[00:00:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:33] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [00:01:43] RECOVERY - Host wdqs1013 is UP: PING WARNING - Packet loss = 33%, RTA = 0.33 ms [00:49:03] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 14411328048 and 1062 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:00:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:32:50] (03PS1) 10Jforrester: sqldump: Don't use wfGetLB(), we're killing it off [puppet] - 10https://gerrit.wikimedia.org/r/698649 [02:07:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.9 [core] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698652 [02:07:50] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.9 [core] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698652 (owner: 10TrainBranchBot) [02:29:32] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.9 [core] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698652 (owner: 10TrainBranchBot) [02:34:05] !log [WDQS] `ryankemper@wdqs1005:~$ sudo depool` (catching up on ~7h of lag) [02:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:40] !log T284445 after manually stopping blazegraph/wdqs-updater, `sudo rm -fv /srv/wdqs/wikidata.jnl` on `wdqs1012` (clearing old overinflated journal file away before xferring new one) [02:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:44] T284445: Blazegraph journal too large on wdqs1012 - https://phabricator.wikimedia.org/T284445 [02:38:40] !log T284445 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1011.eqiad.wmnet --dest wdqs1012.eqiad.wmnet --reason "repairing overinflated blazegraph journal" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `wdqs` [02:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:50] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [02:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:23] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:18:09] (03PS1) 10Samwilson: Enable Wikisource OCR on select Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698654 (https://phabricator.wikimedia.org/T283898) [03:19:17] (03CR) 10jerkins-bot: [V: 04-1] Enable Wikisource OCR on select Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698654 (https://phabricator.wikimedia.org/T283898) (owner: 10Samwilson) [03:21:08] (03PS2) 10Samwilson: Enable Wikisource OCR on select Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698654 (https://phabricator.wikimedia.org/T283898) [03:48:03] PROBLEM - Host mw1334 is DOWN: PING CRITICAL - Packet loss = 100% [03:49:27] RECOVERY - Host mw1334 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [04:00:25] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:44] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [04:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:07] (03PS1) 10Marostegui: install_server: Do not format /srv on dbstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/698656 (https://phabricator.wikimedia.org/T283125) [04:32:39] (03PS2) 10Marostegui: install_server: Do not format /srv on dbstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/698656 (https://phabricator.wikimedia.org/T283125) [04:39:50] (03CR) 10ArielGlenn: [C: 03+1] "Thumbs up from me in that case." [puppet] - 10https://gerrit.wikimedia.org/r/698636 (owner: 10Bstorm) [04:47:07] (03CR) 10Marostegui: [C: 03+2] install_server: Do not format /srv on dbstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/698656 (https://phabricator.wikimedia.org/T283125) (owner: 10Marostegui) [04:50:23] (03CR) 10Marostegui: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/689092 (https://phabricator.wikimedia.org/T282209) (owner: 10Andrew Bogott) [04:52:00] (03PS1) 10Marostegui: dbproxy1018: Repool clouddb1019:3314 [puppet] - 10https://gerrit.wikimedia.org/r/698657 [04:53:35] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Repool clouddb1019:3314 [puppet] - 10https://gerrit.wikimedia.org/r/698657 (owner: 10Marostegui) [04:54:29] !log Repool clouddb1019:3314 [04:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:21] (03PS1) 10Marostegui: db2123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/698658 (https://phabricator.wikimedia.org/T283235) [04:59:09] (03CR) 10Marostegui: [C: 03+2] db2123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/698658 (https://phabricator.wikimedia.org/T283235) (owner: 10Marostegui) [05:09:58] (03PS1) 10Marostegui: install_server: Reimage db2123 to buster [puppet] - 10https://gerrit.wikimedia.org/r/698659 (https://phabricator.wikimedia.org/T283235) [05:15:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2123.codfw.wmnet with reason: REIMAGE [05:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:48] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2123 to buster [puppet] - 10https://gerrit.wikimedia.org/r/698659 (https://phabricator.wikimedia.org/T283235) (owner: 10Marostegui) [05:17:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2123.codfw.wmnet with reason: REIMAGE [05:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2123.codfw.wmnet with reason: REIMAGE [05:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2123.codfw.wmnet with reason: REIMAGE [05:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:08] (03PS1) 10Marostegui: Revert "db2113: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/698669 [05:51:00] (03CR) 10Marostegui: [C: 03+2] Revert "db2113: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/698669 (owner: 10Marostegui) [05:59:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add ca-bundle to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/698456 (https://phabricator.wikimedia.org/T284417) (owner: 10Giuseppe Lavagetto) [06:02:10] (03Merged) 10jenkins-bot: mediawiki: add ca-bundle to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/698456 (https://phabricator.wikimedia.org/T284417) (owner: 10Giuseppe Lavagetto) [06:27:04] !log clean some airflow logs on an-airflow1001 as one off to free space (had a chat with the Search team first) [06:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:04] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Errors - https://phabricator.wikimedia.org/T283518 (10ayounsi) 05Resolved→03Open ` 3898 duplicate cable label (site eqiad) 3898 duplicate cable label (site eqiad) 3899 duplicate cable label (site eqiad) 3899 duplicate cable label (site eqiad) 2574 duplicate cable... [06:40:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1161 for upgrade', diff saved to https://phabricator.wikimedia.org/P16320 and previous config saved to /var/cache/conftool/dbconfig/20210608-064055-marostegui.json [06:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P16321 and previous config saved to /var/cache/conftool/dbconfig/20210608-064426-root.json [06:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:07] (03PS1) 10Volans: admin: add newly added people to LDAP/wmf [puppet] - 10https://gerrit.wikimedia.org/r/698714 (https://phabricator.wikimedia.org/T284437) [06:52:41] !log T283606: running mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki={ar,bn,cs,vi}wiki --verbose --search-index with gerrit:696307 applied [06:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:46] T283606: Add a link: too many articles have no suggestions upon arrival - https://phabricator.wikimedia.org/T283606 [06:55:54] (03PS1) 10Marostegui: install_server: Reimage db1130 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/698715 (https://phabricator.wikimedia.org/T283235) [06:57:05] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1130 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/698715 (https://phabricator.wikimedia.org/T283235) (owner: 10Marostegui) [06:59:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P16322 and previous config saved to /var/cache/conftool/dbconfig/20210608-065930-root.json [06:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] (03PS1) 10Muehlenhoff: Record extended MOU date for aarora [puppet] - 10https://gerrit.wikimedia.org/r/698716 [07:00:28] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/698716 (owner: 10Muehlenhoff) [07:01:39] (03CR) 10Muehlenhoff: [C: 03+2] Record extended MOU date for aarora [puppet] - 10https://gerrit.wikimedia.org/r/698716 (owner: 10Muehlenhoff) [07:03:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/698714 (https://phabricator.wikimedia.org/T284437) (owner: 10Volans) [07:07:22] (03PS2) 10Volans: admin: add newly added people to LDAP/wmf [puppet] - 10https://gerrit.wikimedia.org/r/698714 (https://phabricator.wikimedia.org/T284437) [07:08:44] (03PS1) 10Marostegui: realm.pp: Add ldap_domains table to the private list [puppet] - 10https://gerrit.wikimedia.org/r/698718 (https://phabricator.wikimedia.org/T284106) [07:10:06] I see almost all jobs in zuul as queued and just one of the tests running [07:10:08] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/698485 (owner: 10Muehlenhoff) [07:10:26] is there any known problem? [07:10:34] (03PS2) 10Muehlenhoff: role::dumps::distribution::server: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698485 [07:11:06] (03CR) 10Marostegui: "This requires a mysql restart on all sanitarium hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/698718 (https://phabricator.wikimedia.org/T284106) (owner: 10Marostegui) [07:13:05] (03CR) 10Ayounsi: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [07:14:03] (03CR) 10Volans: [C: 03+2] admin: add newly added people to LDAP/wmf [puppet] - 10https://gerrit.wikimedia.org/r/698714 (https://phabricator.wikimedia.org/T284437) (owner: 10Volans) [07:14:33] volans: looks to be quite a few patches running [07:14:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P16323 and previous config saved to /var/cache/conftool/dbconfig/20210608-071433-root.json [07:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:59] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix main_app.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/698719 [07:15:09] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: fix main_app.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/698719 (owner: 10Giuseppe Lavagetto) [07:15:22] https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10&from=now-15m&to=now [07:15:28] 40 running / 100 waiting [07:15:54] (03CR) 10Ayounsi: "2 PCC warnings:" [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [07:19:50] (03PS2) 10Giuseppe Lavagetto: mediawiki: fix main_app.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/698719 [07:20:13] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Access request to superset for user lzaman - https://phabricator.wikimedia.org/T284249 (10Volans) @LZaman just to clarify. If you also have to access dashboards with private data then additional steps would be required. See https://wikitech.wikimedia.org/wik... [07:21:51] (03CR) 10Ayounsi: "From PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [07:22:34] (03PS1) 10Jelto: upgrade jelto to root shell user (ops) [puppet] - 10https://gerrit.wikimedia.org/r/698720 [07:25:13] (03PS3) 10Itamar Givon: Set Wikidata's main sandbox item [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698518 (https://phabricator.wikimedia.org/T219215) [07:25:24] (03CR) 10Itamar Givon: Set Wikidata's main sandbox item (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698518 (https://phabricator.wikimedia.org/T219215) (owner: 10Itamar Givon) [07:25:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix main_app.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/698719 (owner: 10Giuseppe Lavagetto) [07:25:47] (03PS4) 10Itamar Givon: Set Wikidata's main sandbox item [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698518 (https://phabricator.wikimedia.org/T219215) [07:28:10] (03Merged) 10jenkins-bot: mediawiki: fix main_app.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/698719 (owner: 10Giuseppe Lavagetto) [07:29:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P16324 and previous config saved to /var/cache/conftool/dbconfig/20210608-072937-root.json [07:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:49] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:43] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:46] (03CR) 10JMeybohm: [C: 03+1] upgrade jelto to root shell user (ops) [puppet] - 10https://gerrit.wikimedia.org/r/698720 (owner: 10Jelto) [07:37:56] (03CR) 10Jelto: [C: 03+2] upgrade jelto to root shell user (ops) [puppet] - 10https://gerrit.wikimedia.org/r/698720 (owner: 10Jelto) [07:40:16] (03CR) 10Jbond: [C: 03+2] P:idp::client::http:site: add support for same site cookie [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) (owner: 10Jbond) [07:40:35] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:27] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet [07:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:28] 10SRE, 10Traffic, 10netops, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10jbond) [07:49:58] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin2002.codfw.wmnet [07:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:35] (03CR) 10MMandere: prometheus: Add dependency between varnish exporter and varnish service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/696282 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [07:55:44] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, and 2 others: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10lucyblackwell) Approved! [07:56:45] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, and 2 others: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Volans) [07:57:59] (03CR) 10Ema: prometheus: Add dependency between varnish exporter and varnish service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/696282 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [08:00:51] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix whitespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/698721 [08:08:50] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Concept makes sense to me, it's a good idea to enable this." [homer/public] - 10https://gerrit.wikimedia.org/r/698512 (https://phabricator.wikimedia.org/T167306) (owner: 10Ayounsi) [08:09:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix whitespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/698721 (owner: 10Giuseppe Lavagetto) [08:12:23] (03Merged) 10jenkins-bot: mediawiki: fix whitespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/698721 (owner: 10Giuseppe Lavagetto) [08:13:28] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [08:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:32] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:04] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:13] 10SRE, 10serviceops, 10User-jbond: Update docker-reporter to only check images available in the respective repos - https://phabricator.wikimedia.org/T284539 (10jbond) [08:19:30] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/698583 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [08:19:37] (03CR) 10Jbond: [C: 03+2] docker-reporter: filter out old removed images [puppet] - 10https://gerrit.wikimedia.org/r/698583 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [08:22:24] 10SRE, 10serviceops, 10User-jbond: Update docker-reporter to only check images available in the respective repos - https://phabricator.wikimedia.org/T284539 (10jbond) [08:22:30] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) [08:22:48] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:23:14] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:29:05] !log restarting blazegraph on wdqs1006 [08:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:30] !log depooling wdqs1006 (lag) [08:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:28] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:44:33] 10SRE, 10netops: Increase Google IX sessions prefix-limit - https://phabricator.wikimedia.org/T284447 (10ayounsi) [08:48:28] (03PS1) 10Muehlenhoff: Deploy the host keytab directly in the profile::base::cuminunpriv profile [puppet] - 10https://gerrit.wikimedia.org/r/698726 (https://phabricator.wikimedia.org/T244840) [08:48:38] (03CR) 10DCausse: "is this still needed?" [puppet] - 10https://gerrit.wikimedia.org/r/688309 (owner: 10ZPapierski) [08:50:01] (03CR) 10jerkins-bot: [V: 04-1] Deploy the host keytab directly in the profile::base::cuminunpriv profile [puppet] - 10https://gerrit.wikimedia.org/r/698726 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [08:50:42] 10SRE, 10netops: Increase Google IX sessions prefix-limit - https://phabricator.wikimedia.org/T284447 (10ayounsi) [08:50:51] (03PS9) 10Jbond: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [08:54:19] 10SRE, 10netops: Increase Google IX sessions prefix-limit - https://phabricator.wikimedia.org/T284447 (10ayounsi) [08:54:48] (03PS2) 10Muehlenhoff: Deploy the host keytab directly in the profile::base::cuminunpriv profile [puppet] - 10https://gerrit.wikimedia.org/r/698726 (https://phabricator.wikimedia.org/T244840) [08:56:17] 10SRE, 10netops: Increase Google IX sessions prefix-limit - https://phabricator.wikimedia.org/T284447 (10ayounsi) [08:56:47] (03PS1) 10Elukey: Swap CNAME/SRV records for dbstore1004 with dbstore1007's ones [dns] - 10https://gerrit.wikimedia.org/r/698729 (https://phabricator.wikimedia.org/T283125) [08:57:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698726 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [08:59:45] (03PS1) 10Giuseppe Lavagetto: Bump the mediawiki chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/698730 [09:00:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Bump the mediawiki chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/698730 (owner: 10Giuseppe Lavagetto) [09:01:41] (03CR) 10ZPapierski: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/688309 (owner: 10ZPapierski) [09:01:46] (03Abandoned) 10ZPapierski: Push the limit for shads queried in relforge [puppet] - 10https://gerrit.wikimedia.org/r/688309 (owner: 10ZPapierski) [09:02:34] (03Merged) 10jenkins-bot: Bump the mediawiki chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/698730 (owner: 10Giuseppe Lavagetto) [09:04:16] !log removing docker-images from registry: releng/ci-jessie, releng/ci-src-setup, releng/composer-php56, releng/composer-test-php56, releng/npm, releng/npm-test, releng/npm-test-3d2png, releng/npm-test-graphoid, releng/npm-test-librdkafka, releng/npm-test-maps-service, releng/php56, releng/quibble-jessie, releng/quibble-jessie-hhvm, releng/quibble-jessie-php56 - T251918 [09:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:21] T251918: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 [09:08:21] 10SRE, 10Traffic, 10netops, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10fgiunchedi) In case it is useful: a lighter weight (but one that we have to maintain ourselves) solution for prometheus would be to use node-exporter's textfile collector... [09:10:14] (03PS10) 10Jbond: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [09:14:30] (03PS2) 10Volans: Add schoenbaechler to analytics-privatedata-users, no ssh [puppet] - 10https://gerrit.wikimedia.org/r/696467 (https://phabricator.wikimedia.org/T283190) (owner: 10Ottomata) [09:20:10] 10SRE, 10netops: Increase Google IX sessions prefix-limit - https://phabricator.wikimedia.org/T284447 (10ayounsi) [09:20:49] 10SRE, 10netops: Increase Google IX sessions prefix-limit - https://phabricator.wikimedia.org/T284447 (10ayounsi) a:03ayounsi [09:21:58] (03CR) 10Muehlenhoff: [C: 03+2] ssh: Remove deprecated option UsePrivilegeSeparation sandbox [puppet] - 10https://gerrit.wikimedia.org/r/635288 (https://phabricator.wikimedia.org/T170298) (owner: 10Jcrespo) [09:22:29] (03PS1) 10Filippo Giunchedi: icinga: update reading web Grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/698735 (https://phabricator.wikimedia.org/T281359) [09:23:34] 10SRE: sshd warning on cache nodes: Deprecated option UsePrivilegeSeparation - https://phabricator.wikimedia.org/T245635 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This was fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/635288 [09:23:51] (03CR) 10Volans: "Removed manuel's -2 as L3 was signed and got manager approval, see task. Ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/696467 (https://phabricator.wikimedia.org/T283190) (owner: 10Ottomata) [09:24:28] (03CR) 10Marostegui: "Thanks volans!" [puppet] - 10https://gerrit.wikimedia.org/r/696467 (https://phabricator.wikimedia.org/T283190) (owner: 10Ottomata) [09:26:23] (03CR) 10Muehlenhoff: [C: 03+2] role::dumps::distribution::server: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698485 (owner: 10Muehlenhoff) [09:26:52] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix the puppet ca inclusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/698736 [09:29:47] 10SRE, 10Traffic, 10netops, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10cmooney) Thanks @ayounsi, That exaring project looks to be a fairly sensible approach alright, albeit fairly new. Might be worth testing out. I note our NS seem to use... [09:30:15] 10SRE, 10Patch-For-Review: sshd stretch puppet support - https://phabricator.wikimedia.org/T170298 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete, the obsolete options have been removed from our sshd template. [09:32:41] 10SRE, 10netops: Increase Google IX sessions prefix-limit - https://phabricator.wikimedia.org/T284447 (10cmooney) [09:33:09] (03PS7) 10Jbond: IDM: create new idm library with logoutd base class [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) [09:35:58] (03CR) 10jerkins-bot: [V: 04-1] IDM: create new idm library with logoutd base class [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [09:39:41] (03CR) 10Jbond: IDM: create new idm library with logoutd base class (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [09:41:28] (03PS8) 10Jbond: IDM: create new idm library with logoutd base class [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) [09:43:18] 10SRE, 10netops: Increase Google IX sessions prefix-limit - https://phabricator.wikimedia.org/T284447 (10cmooney) [09:46:43] (03PS1) 10Volans: setup.py: change setuptools_scm tag regex [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698740 [09:47:00] jayme: ^^^ [09:47:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix the puppet ca inclusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/698736 (owner: 10Giuseppe Lavagetto) [09:48:08] volans: thanks [09:49:09] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: update dashboard minimum group width to 2048 [puppet] - 10https://gerrit.wikimedia.org/r/698507 (https://phabricator.wikimedia.org/T284213) (owner: 10Filippo Giunchedi) [09:49:19] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: print link separators on IRC when needed [puppet] - 10https://gerrit.wikimedia.org/r/698459 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [09:50:04] (03Merged) 10jenkins-bot: mediawiki: fix the puppet ca inclusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/698736 (owner: 10Giuseppe Lavagetto) [09:50:55] (03CR) 10JMeybohm: [C: 03+2] setup.py: change setuptools_scm tag regex [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698740 (owner: 10Volans) [09:51:42] (03CR) 10Jbond: Add CAS authentication support (031 comment) [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [09:52:44] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:25] (03CR) 10jerkins-bot: [V: 04-1] setup.py: change setuptools_scm tag regex [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698740 (owner: 10Volans) [09:54:49] what's wrong jerkins [09:55:39] 10SRE, 10netops: Increase Google IX sessions prefix-limit - https://phabricator.wikimedia.org/T284447 (10cmooney) 05Open→03Resolved [09:55:43] ERROR: HTTP error 502 while getting from pythonhosted.org :/ [09:56:47] (03CR) 10Muehlenhoff: [C: 03+1] prometheus: Add dependency between varnish exporter and varnish service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/696282 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [09:57:57] !log jbond@deploy1002 Started deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (4) [09:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:51] !log jbond@deploy1002 Finished deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (4) (duration: 00m 54s) [09:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:23] !log upgrade Routinator 3000 to 0.9.0 on rpki2001 - T282469 [10:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:27] T282469: routinator: create garbage collection job - https://phabricator.wikimedia.org/T282469 [10:03:58] (03CR) 10JMeybohm: [C: 03+2] "recheck" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698740 (owner: 10Volans) [10:04:42] 10SRE, 10serviceops, 10User-jbond: Update docker-reporter to only check images available in the respective repos - https://phabricator.wikimedia.org/T284539 (10jbond) p:05Triage→03Low [10:05:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:05:46] (03CR) 10jerkins-bot: [V: 04-1] setup.py: change setuptools_scm tag regex [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698740 (owner: 10Volans) [10:06:30] expected ^ it's booting up [10:07:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [10:08:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:16:33] !log testing upcoming Scap release on beta [10:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/696467 (https://phabricator.wikimedia.org/T283190) (owner: 10Ottomata) [10:20:45] (03PS3) 10Volans: Add schoenbaechler to analytics-privatedata-users, no ssh [puppet] - 10https://gerrit.wikimedia.org/r/696467 (https://phabricator.wikimedia.org/T283190) (owner: 10Ottomata) [10:21:18] (03CR) 10jerkins-bot: [V: 04-1] Add schoenbaechler to analytics-privatedata-users, no ssh [puppet] - 10https://gerrit.wikimedia.org/r/696467 (https://phabricator.wikimedia.org/T283190) (owner: 10Ottomata) [10:24:25] PROBLEM - Alertmanager IRC relay is not connected on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=alertmanager prometheus=ops site=eqiad https://wikitech.wikimedia.org/wiki/Alertmanager%23Alerts https://grafana.wikimedia.org/d/eea-9_sik/alertmanager [10:29:20] yes that's true ^ :( debugging [10:29:31] RECOVERY - Alertmanager IRC relay is not connected on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Alertmanager%23Alerts https://grafana.wikimedia.org/d/eea-9_sik/alertmanager [10:29:55] yeah that's me restarting alertmanager-irc-relay, though still not connecting to libera [10:32:39] (03CR) 10Kormat: [C: 03+1] realm.pp: Add ldap_domains table to the private list [puppet] - 10https://gerrit.wikimedia.org/r/698718 (https://phabricator.wikimedia.org/T284106) (owner: 10Marostegui) [10:34:41] PROBLEM - Alertmanager IRC relay is not connected on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=alertmanager prometheus=ops site=eqiad https://wikitech.wikimedia.org/wiki/Alertmanager%23Alerts https://grafana.wikimedia.org/d/eea-9_sik/alertmanager [10:38:05] RECOVERY - Alertmanager IRC relay is not connected on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Alertmanager%23Alerts https://grafana.wikimedia.org/d/eea-9_sik/alertmanager [10:38:28] uh mmhh jinxer-wm got banned [10:38:57] I'll write to the contact email [10:43:13] PROBLEM - Alertmanager IRC relay is not connected on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=alertmanager prometheus=ops site=eqiad https://wikitech.wikimedia.org/wiki/Alertmanager%23Alerts https://grafana.wikimedia.org/d/eea-9_sik/alertmanager [10:43:37] known ^ I'll ack [10:44:15] (03PS1) 10WMDE-Fisch: [beta] Enable new search features for the TemplateWizard dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698754 (https://phabricator.wikimedia.org/T271802) [10:45:13] (03PS11) 10Jbond: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [10:45:19] (03PS1) 10Muehlenhoff: Enable install* hosts for unprivileged Cumin [puppet] - 10https://gerrit.wikimedia.org/r/698755 [10:45:42] (03PS2) 10WMDE-Fisch: [beta] Enable new search features for the TemplateWizard dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698754 (https://phabricator.wikimedia.org/T271802) [10:46:41] (03PS1) 10Ema: cloud: add ATS trusted_ca_path hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/698756 (https://phabricator.wikimedia.org/T281673) [10:47:04] (03PS1) 10Kormat: Revert "db1157: Disable notifications." [puppet] - 10https://gerrit.wikimedia.org/r/698677 [10:47:35] (03PS2) 10Ema: cloud: add ATS trusted_ca_path hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/698756 (https://phabricator.wikimedia.org/T281673) [10:47:37] (03PS12) 10Jbond: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [10:48:03] (03CR) 10Kormat: [C: 03+2] Revert "db1157: Disable notifications." [puppet] - 10https://gerrit.wikimedia.org/r/698677 (owner: 10Kormat) [10:49:16] !log jbond@deploy1002 Started deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (4) [10:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:09] !log jbond@deploy1002 Finished deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (4) (duration: 00m 53s) [10:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698755 (owner: 10Muehlenhoff) [10:51:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "For deployment, sync InitialiseSettings.php first, then Wikibase.php." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698518 (https://phabricator.wikimedia.org/T219215) (owner: 10Itamar Givon) [10:53:07] (03CR) 10Phuedx: "Recheck." [extensions/WikimediaEvents] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698535 (https://phabricator.wikimedia.org/T280770) (owner: 10Phuedx) [10:53:47] !log kormat@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: reimaged to buster T283131', diff saved to https://phabricator.wikimedia.org/P16326 and previous config saved to /var/cache/conftool/dbconfig/20210608-105346-kormat.json [10:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:55] T283131: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 [10:55:29] jouncebot: now [10:55:29] No deployments scheduled for the next 0 hour(s) and 4 minute(s) [10:55:33] jouncebot: next [10:55:33] In 0 hour(s) and 4 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210608T1100) [10:56:18] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:46] (03PS1) 10Kormat: Revert "db-eqiad.php: Set pc1010 as pc2 primary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698678 [10:57:36] kormat: reminder: upgrade pc1008's kernel [10:57:52] marostegui: oh. fiiiine. [10:57:53] (03CR) 10Awight: [C: 03+2] [beta] Enable new search features for the TemplateWizard dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698754 (https://phabricator.wikimedia.org/T271802) (owner: 10WMDE-Fisch) [10:58:00] kormat: :) [10:58:28] marostegui: no kernel updates available, but there is a wmf-mariadb104 update. i'm assuming we want that. [10:58:42] (03Merged) 10jenkins-bot: [beta] Enable new search features for the TemplateWizard dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698754 (https://phabricator.wikimedia.org/T271802) (owner: 10WMDE-Fisch) [10:58:54] kormat: it was probably installed but the host wasn't rebooted [10:59:02] uptime 273 days, yaaay [10:59:02] ack [10:59:05] so yeah, needs the reboot [10:59:16] marostegui: and the upgrade to 10.4.19? [10:59:20] +1! [10:59:26] ok, 10.4.20 it is! [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210608T1100). [11:00:05] Urbanecm and phuedx: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] o/ [11:00:10] i can deploy today [11:00:11] o/ [11:00:20] ok [11:00:32] (03CR) 10Urbanecm: [C: 03+2] Pass context to compact_language_links.open hook [extensions/UniversalLanguageSelector] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698537 (https://phabricator.wikimedia.org/T280770) (owner: 10Phuedx) [11:00:34] (03CR) 10Urbanecm: [C: 03+2] universalLanguageSelector: Add missing properties [extensions/WikimediaEvents] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698535 (https://phabricator.wikimedia.org/T280770) (owner: 10Phuedx) [11:00:53] (unless kormat wants to sync sth first? ) [11:01:03] urbanecm: no go ahead :) [11:01:07] ok ok :) [11:01:18] i'm about to lunch anyway. just wanted to make sure i wouldn't be conflicting later [11:01:45] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on pc2008.codfw.wmnet,pc1008.eqiad.wmnet with reason: Rebooting pc1008 [11:01:46] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2008.codfw.wmnet,pc1008.eqiad.wmnet with reason: Rebooting pc1008 [11:01:47] enjoy your lunch then :) [11:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:52] (03PS2) 10Urbanecm: enwiki: Deploy Growth freatures to 2% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698555 (https://phabricator.wikimedia.org/T281896) [11:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:58] (03CR) 10Urbanecm: [C: 03+2] enwiki: Deploy Growth freatures to 2% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698555 (https://phabricator.wikimedia.org/T281896) (owner: 10Urbanecm) [11:02:49] (03Merged) 10jenkins-bot: enwiki: Deploy Growth freatures to 2% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698555 (https://phabricator.wikimedia.org/T281896) (owner: 10Urbanecm) [11:03:03] (03CR) 10Volans: IDM: create new idm library with logoutd base class (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [11:03:48] awight: please don't forget to git fetch at deployment host when you +2 a beta-only config patch. It always worries me a bit when I see more than one patch fetched :) Thanks! [11:03:48] (03CR) 10Volans: "recheck" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/696467 (https://phabricator.wikimedia.org/T283190) (owner: 10Ottomata) [11:05:42] (03CR) 10Urbanecm: [C: 03+2] lvwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697928 (https://phabricator.wikimedia.org/T278191) (owner: 10Urbanecm) [11:05:52] (03CR) 10jerkins-bot: [V: 04-1] lvwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697928 (https://phabricator.wikimedia.org/T278191) (owner: 10Urbanecm) [11:05:57] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: abd401074247d1f1dd2722c2d4d06747b066d547: enwiki: Deploy Growth freatures to 2% of new accounts (T281896) (duration: 00m 57s) [11:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:02] T281896: Deploy Growth features on English Wikipedia - https://phabricator.wikimedia.org/T281896 [11:06:08] * urbanecm fixing lvwiki config [11:07:00] (03PS4) 10Urbanecm: lvwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697928 (https://phabricator.wikimedia.org/T278191) [11:07:09] (03CR) 10Urbanecm: [C: 03+2] lvwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697928 (https://phabricator.wikimedia.org/T278191) (owner: 10Urbanecm) [11:07:48] (03CR) 10Urbanecm: [C: 03+2] universalLanguageSelector: Add missing properties [extensions/WikimediaEvents] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698535 (https://phabricator.wikimedia.org/T280770) (owner: 10Phuedx) [11:08:08] (03Merged) 10jenkins-bot: lvwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697928 (https://phabricator.wikimedia.org/T278191) (owner: 10Urbanecm) [11:08:12] (03CR) 10Volans: [C: 03+1] "LGTM, one comment inline for potential improvement" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/698726 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [11:08:51] !log kormat@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: reimaged to buster T283131', diff saved to https://phabricator.wikimedia.org/P16327 and previous config saved to /var/cache/conftool/dbconfig/20210608-110850-kormat.json [11:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:56] (03PS1) 10Giuseppe Lavagetto: Add ca-certificates to the php images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/698758 (https://phabricator.wikimedia.org/T284417) [11:08:56] T283131: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 [11:09:08] (03CR) 10Volans: [C: 03+2] Add schoenbaechler to analytics-privatedata-users, no ssh [puppet] - 10https://gerrit.wikimedia.org/r/696467 (https://phabricator.wikimedia.org/T283190) (owner: 10Ottomata) [11:10:12] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add ca-certificates to the php images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/698758 (https://phabricator.wikimedia.org/T284417) (owner: 10Giuseppe Lavagetto) [11:10:29] !log mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=lvwiki growthexperiments # T278191 [11:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:34] T278191: Deploy Growth experiments at Latvian Wikipedia - https://phabricator.wikimedia.org/T278191 [11:12:03] (03CR) 10Phuedx: [C: 03+1] icinga: update reading web Grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/698735 (https://phabricator.wikimedia.org/T281359) (owner: 10Filippo Giunchedi) [11:12:41] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 73dc708efc25caa667be516c685885db3983be73: lvwiki: Enable Growth features in dark mode (T278191; 1/3) (duration: 00m 57s) [11:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:54] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Volans) The patch has been merged and deployed, it will be effective within ~30 minutes from now. @schoen... [11:13:52] !log urbanecm@deploy1002 Synchronized wmf-config/config/lvwiki.yaml: 73dc708efc25caa667be516c685885db3983be73: lvwiki: Enable Growth features in dark mode (T278191; 2/3) (duration: 00m 56s) [11:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:59] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10hashar) +1 on the image that got deleted. Note that we might have some images that got moved from Jessie, Stretch, Buster but having kept their name. I don't... [11:15:11] (03CR) 10Muehlenhoff: Deploy the host keytab directly in the profile::base::cuminunpriv profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/698726 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [11:15:44] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 73dc708efc25caa667be516c685885db3983be73: lvwiki: Enable Growth features in dark mode (T278191; 3/3) (duration: 00m 58s) [11:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:52] T278191: Deploy Growth experiments at Latvian Wikipedia - https://phabricator.wikimedia.org/T278191 [11:18:30] urbanecm: Thanks for the reminder to update *-labs files on deployment! [11:18:37] np :) [11:22:38] phuedx: I'm now waiting on CI to merge your patches [11:22:56] (03PS1) 10Jbond: P:pki::multirootca: Add puppet CA to the the pi web site [puppet] - 10https://gerrit.wikimedia.org/r/698760 [11:22:57] urbanecm: I'm watching CI intently [11:23:04] (y) [11:23:54] !log kormat@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: reimaged to buster T283131', diff saved to https://phabricator.wikimedia.org/P16328 and previous config saved to /var/cache/conftool/dbconfig/20210608-112354-kormat.json [11:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:59] T283131: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 [11:25:21] (03Merged) 10jenkins-bot: Pass context to compact_language_links.open hook [extensions/UniversalLanguageSelector] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698537 (https://phabricator.wikimedia.org/T280770) (owner: 10Phuedx) [11:25:32] \o/ [11:25:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29816/console" [puppet] - 10https://gerrit.wikimedia.org/r/698760 (owner: 10Jbond) [11:26:39] finally [11:26:54] (03CR) 10Urbanecm: [C: 03+2] universalLanguageSelector: Add missing properties [extensions/WikimediaEvents] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698535 (https://phabricator.wikimedia.org/T280770) (owner: 10Phuedx) [11:27:14] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca: Add puppet CA to the the pi web site [puppet] - 10https://gerrit.wikimedia.org/r/698760 (owner: 10Jbond) [11:27:22] phuedx: your patch is at mwdebug1001. Can you test, please? [11:27:32] urbanecm: Sure [11:30:14] (03PS1) 10Jbond: P:pki::multirootca: use source not content [puppet] - 10https://gerrit.wikimedia.org/r/698761 [11:30:23] urbanecm: 698537 LGTM [11:30:28] thanks, syncing [11:31:12] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/698755 (owner: 10Muehlenhoff) [11:31:59] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/UniversalLanguageSelector/resources/js/ext.uls.launch.js: 5df13eeae3b52b98eaf3fdb99ddfa5a0f7b2b1e4: Pass context to compact_language_links.open hook (T280770) (duration: 00m 57s) [11:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:03] T280770: Instrumentation QA for language switching - https://phabricator.wikimedia.org/T280770 [11:32:05] (03CR) 10Volans: [C: 03+1] "I guess that the missing secrets are expected" [puppet] - 10https://gerrit.wikimedia.org/r/698755 (owner: 10Muehlenhoff) [11:32:13] (03CR) 10Jbond: [C: 03+2] P:pki::multirootca: use source not content [puppet] - 10https://gerrit.wikimedia.org/r/698761 (owner: 10Jbond) [11:33:17] phuedx: pulled the second one to mwdebug1001 as well (698535) [11:35:44] urbanecm: LGTM thanks [11:35:48] thanks, syncing [11:37:19] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/WikimediaEvents/: b0b46530b731d2a5f17b0aa04a4cf99df175e23d: universalLanguageSelector: Add missing properties (T280770) (duration: 00m 56s) [11:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:25] T280770: Instrumentation QA for language switching - https://phabricator.wikimedia.org/T280770 [11:38:05] !log installing ruby-nokogiri security updates [11:38:06] phuedx: done [11:38:08] anything else? [11:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:24] Not from me, no [11:38:29] Thanks again, urbanecm [11:38:31] any time [11:38:36] (03PS1) 10Jbond: P:docker: update filter file [puppet] - 10https://gerrit.wikimedia.org/r/698763 [11:38:58] !log kormat@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: reimaged to buster T283131', diff saved to https://phabricator.wikimedia.org/P16329 and previous config saved to /var/cache/conftool/dbconfig/20210608-113857-kormat.json [11:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:03] T283131: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 [11:39:04] (03PS2) 10Jbond: P:docker: update filter file [puppet] - 10https://gerrit.wikimedia.org/r/698763 (https://phabricator.wikimedia.org/T251918) [11:39:31] !log EU B&C deployment done [11:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:52] (03CR) 10Jbond: [C: 03+2] P:docker: update filter file [puppet] - 10https://gerrit.wikimedia.org/r/698763 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [11:41:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloud: add ATS trusted_ca_path hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/698756 (https://phabricator.wikimedia.org/T281673) (owner: 10Ema) [11:43:14] !log Start server-side upload for 2 files (T283645, T283583) [11:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:20] T283583: Server side upload for Lorax - https://phabricator.wikimedia.org/T283583 [11:43:21] T283645: Server side upload for Sturm - https://phabricator.wikimedia.org/T283645 [11:45:01] (03PS3) 10MMandere: prometheus: Add dependency between varnish exporter and varnish service [puppet] - 10https://gerrit.wikimedia.org/r/696282 (https://phabricator.wikimedia.org/T283660) [11:45:17] !log installing nginx security updates on buster [11:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:42] !log Start server-side upload for 1 file (T283470) [11:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:47] T283470: Server side upload for Hedestad - https://phabricator.wikimedia.org/T283470 [11:47:18] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:17] (03CR) 10Urbanecm: [C: 04-1] enwiki: Remove 'collectionsaveascommunitypage' from the 'user' group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698041 (https://phabricator.wikimedia.org/T283523) (owner: 10MarcoAurelio) [11:48:19] (03PS2) 10JMeybohm: setup.py: change setuptools_scm tag regex [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698740 (owner: 10Volans) [11:48:21] (03PS1) 10JMeybohm: Fix out of registry on delete-tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698764 [11:48:23] (03PS1) 10JMeybohm: Remove duplicate log line from Chartmuseum class [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698765 [11:48:25] (03PS1) 10JMeybohm: Don't treat nonexisting image tags as failure on delete [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698766 [11:48:41] I'll do a quick patch anyway [11:48:45] (03PS3) 10Urbanecm: enwiki: Disable indexing on the Book namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698039 (https://phabricator.wikimedia.org/T283522) (owner: 10MarcoAurelio) [11:48:49] (03CR) 10Urbanecm: [C: 03+2] enwiki: Disable indexing on the Book namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698039 (https://phabricator.wikimedia.org/T283522) (owner: 10MarcoAurelio) [11:49:46] (03PS2) 10JMeybohm: Fix output of registry on delete-tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698764 [11:49:48] (03PS2) 10JMeybohm: Remove duplicate log line from Chartmuseum class [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698765 [11:49:50] (03PS2) 10JMeybohm: Don't treat nonexisting image tags as failure on delete [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698766 [11:51:00] (03Merged) 10jenkins-bot: enwiki: Disable indexing on the Book namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698039 (https://phabricator.wikimedia.org/T283522) (owner: 10MarcoAurelio) [11:51:48] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:52:21] (03CR) 10MMandere: prometheus: Add dependency between varnish exporter and varnish service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/696282 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [11:52:56] (03PS1) 10Muehlenhoff: eventschemas: Switch to nginx profile [puppet] - 10https://gerrit.wikimedia.org/r/698767 (https://phabricator.wikimedia.org/T163356) [11:54:14] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ef49422b162ab0161bc39da857b3230175ac4492: enwiki: Disable indexing on the Book namespace (T283522) (duration: 00m 56s) [11:54:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698767 (https://phabricator.wikimedia.org/T163356) (owner: 10Muehlenhoff) [11:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:18] T283522: Disable indexing in the book namespace on enwiki - https://phabricator.wikimedia.org/T283522 [11:56:26] (03PS1) 10JMeybohm: Relase new version 0.0.12-1 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698769 [12:00:44] (03PS1) 10Muehlenhoff: eventschemas: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698771 (https://phabricator.wikimedia.org/T164456) [12:04:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698771 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [12:10:12] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: update reading web Grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/698735 (https://phabricator.wikimedia.org/T281359) (owner: 10Filippo Giunchedi) [12:13:03] jouncebot: next [12:13:03] In 3 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210608T1600) [12:13:17] (03CR) 10Kormat: [C: 03+2] Revert "db-eqiad.php: Set pc1010 as pc2 primary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698678 (owner: 10Kormat) [12:14:19] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Set pc1010 as pc2 primary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698678 (owner: 10Kormat) [12:14:26] !log setting pc1008 back as pc2 primary T282761 [12:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:31] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [12:15:51] !log kormat@deploy1002 Synchronized wmf-config/db-eqiad.php: Repool pc1008 as pc2 master T282761 (duration: 00m 57s) [12:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:44] (03CR) 10Ema: [C: 03+2] cloud: add ATS trusted_ca_path hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/698756 (https://phabricator.wikimedia.org/T281673) (owner: 10Ema) [12:18:20] arturo: thanks for the review :) [12:18:25] (03PS1) 10Muehlenhoff: Add dummy keytabs for install* [labs/private] - 10https://gerrit.wikimedia.org/r/698776 [12:18:29] <3 [12:21:51] (03PS1) 10Kormat: db-eqiad.php: Set pc1010 as pc3 primary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698777 (https://phabricator.wikimedia.org/T282761) [12:22:24] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:22:58] (03CR) 10Daimona Eaytoy: [C: 03+1] "Won't be there for deployment, unfortunately." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698654 (https://phabricator.wikimedia.org/T283898) (owner: 10Samwilson) [12:24:34] (03PS1) 10Kormat: pc1010: Move to pc3. [puppet] - 10https://gerrit.wikimedia.org/r/698778 (https://phabricator.wikimedia.org/T282761) [12:29:15] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add dummy keytabs for install* [labs/private] - 10https://gerrit.wikimedia.org/r/698776 (owner: 10Muehlenhoff) [12:30:08] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/698755 (owner: 10Muehlenhoff) [12:31:50] (03PS13) 10Jbond: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [12:34:22] (03PS14) 10Jbond: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [12:34:52] PROBLEM - MariaDB Replica Lag: pc2 on pc2008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 492.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:38:07] ACKNOWLEDGEMENT - MariaDB Replica Lag: pc2 on pc2008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 533.86 seconds Kormat Expected post-repooling lag. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:38:09] (03CR) 10Majavah: toolforge: Remove non-helm ingress-nginx files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/698588 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [12:54:45] (03CR) 10Muehlenhoff: [C: 03+2] Deploy the host keytab directly in the profile::base::cuminunpriv profile [puppet] - 10https://gerrit.wikimedia.org/r/698726 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [12:59:10] (03CR) 10Subramanya Sastry: [C: 03+1] Switch scandium/testreduce to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698155 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [13:00:32] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:09:12] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:14] RECOVERY - Alertmanager IRC relay is not connected on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Alertmanager%23Alerts https://grafana.wikimedia.org/d/eea-9_sik/alertmanager [13:14:31] (03CR) 10Ottomata: [C: 03+1] eventschemas: Switch to nginx profile [puppet] - 10https://gerrit.wikimedia.org/r/698767 (https://phabricator.wikimedia.org/T163356) (owner: 10Muehlenhoff) [13:14:51] (03CR) 10Ottomata: [C: 03+1] eventschemas: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698771 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [13:17:03] !issync [13:17:03] Syncing #wikimedia-operations (requested by legoktm) [13:17:07] Set /cs flags #wikimedia-operations wmopbot +Vv [13:17:07] Set /cs flags #wikimedia-operations jinxer-wm +Vv [13:17:10] Set /cs flags #wikimedia-operations stashbot +Vv [13:17:11] Set /cs flags #wikimedia-operations jouncebot +Vv [13:17:17] Set /cs flags #wikimedia-operations wikibugs +Vv [13:17:17] Set /cs flags #wikimedia-operations logmsgbot +Vv [13:17:17] Set /cs flags #wikimedia-operations wm-bot +Vv [13:17:19] Set /cs flags #wikimedia-operations icinga-wm +Vv [13:17:21] Set /cs flags #wikimedia-operations ircservserv-wm +V [13:17:45] it's maaagic [13:18:02] 10SRE, 10SRE-Access-Requests: Requesting access to production shell groups for DNdubane - https://phabricator.wikimedia.org/T266791 (10DNdubane_WMF) [13:18:24] 10SRE, 10SRE-Access-Requests: Requesting access to production shell groups for DNdubane - https://phabricator.wikimedia.org/T266791 (10DNdubane_WMF) 05Resolved→03Open [13:19:32] PROBLEM - Alertmanager IRC relay is not connected on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=alertmanager prometheus=ops site=eqiad https://wikitech.wikimedia.org/wiki/Alertmanager%23Alerts https://grafana.wikimedia.org/d/eea-9_sik/alertmanager [13:19:52] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) 05Open→03Resolved Thanks @jbond for looking after this. I'll bluntly close this task again now. [13:22:45] !log otto@cumin1001 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. - otto@cumin1001 [13:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:04] (03PS15) 10Jbond: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [13:29:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698755 (owner: 10Muehlenhoff) [13:33:09] !log otto@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. - otto@cumin1001 [13:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:46] 10SRE, 10SRE-Access-Requests: Requesting access to production shell groups for DNdubane - https://phabricator.wikimedia.org/T266791 (10DNdubane_WMF) I would like to request an update of my public key chain, as I was experiencing an error when I was trying to connect to the STAT6 machine. We figured that it mig... [13:34:40] 10SRE, 10SRE-Access-Requests: Requesting access to production shell groups for DNdubane - https://phabricator.wikimedia.org/T266791 (10elukey) I had a chat with @DNdubane_WMF over slack about this issue, and suggested to reopen the task :) [13:35:18] !log jbond@deploy1002 Started deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next [13:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:35] (03CR) 10Muehlenhoff: [C: 03+2] Enable install* hosts for unprivileged Cumin [puppet] - 10https://gerrit.wikimedia.org/r/698755 (owner: 10Muehlenhoff) [13:36:21] !log jbond@deploy1002 Finished deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (duration: 01m 03s) [13:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:50] 10SRE, 10Traffic, 10VPS-project-Codesearch, 10netops, 10serviceops: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BBlack) When we looked into this for the Bird-based anycast stuff, we found that the combination you w... [13:39:48] !log jbond@deploy1002 Started deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next [13:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:13] 10SRE, 10Traffic, 10VPS-project-Codesearch, 10serviceops: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10ayounsi) [13:40:35] !log jbond@deploy1002 Finished deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (duration: 00m 47s) [13:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:52] Have any changes been deployed in the past few days that would affect how pages with ?action=raw would be returned? It seems that AutoWikiBrowser's updater has broken as a result of something. I'm testing hitting https://en.wikipedia.org/w/index.php?title=Wikipedia:AutoWikiBrowser/CheckPage/VersionJSON&action=raw with curl, and I'm not getting raw JSON, but instead I'm getting the full HTML of the page. [13:41:31] !log otto@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper [13:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:11] I figured out the curl issue, but AWB is still broken. I might need to do some more investigation. [13:47:35] 10SRE, 10SRE-Access-Requests: Requesting access to production shell groups for DNdubane - https://phabricator.wikimedia.org/T266791 (10DNdubane_WMF) [13:48:43] !log otto@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) [13:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:28] (03PS16) 10Jbond: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [13:51:11] !log jbond@deploy1002 Started deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next [13:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:36] (03CR) 10Andrew Bogott: [C: 03+2] profile:mariadb:core: Hack in access from labwebs to s6 [puppet] - 10https://gerrit.wikimedia.org/r/689092 (https://phabricator.wikimedia.org/T282209) (owner: 10Andrew Bogott) [13:51:54] !log jbond@deploy1002 Finished deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (duration: 00m 42s) [13:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:06] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:53:31] (03CR) 10Marostegui: [C: 03+1] db-eqiad.php: Set pc1010 as pc3 primary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698777 (https://phabricator.wikimedia.org/T282761) (owner: 10Kormat) [13:53:50] (03CR) 10Marostegui: [C: 03+1] pc1010: Move to pc3. [puppet] - 10https://gerrit.wikimedia.org/r/698778 (https://phabricator.wikimedia.org/T282761) (owner: 10Kormat) [13:54:31] (03PS1) 10Volans: admin: update dumisani's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/698790 (https://phabricator.wikimedia.org/T266791) [13:54:50] (03CR) 10Andrew Bogott: [C: 03+1] realm.pp: Add ldap_domains table to the private list [puppet] - 10https://gerrit.wikimedia.org/r/698718 (https://phabricator.wikimedia.org/T284106) (owner: 10Marostegui) [13:57:37] (03CR) 10Elukey: [C: 03+1] admin: update dumisani's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/698790 (https://phabricator.wikimedia.org/T266791) (owner: 10Volans) [13:57:44] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [13:59:08] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 31.53 ms [13:59:32] (03CR) 10Elukey: [C: 03+1] "key verified over Slack" [puppet] - 10https://gerrit.wikimedia.org/r/698790 (https://phabricator.wikimedia.org/T266791) (owner: 10Volans) [14:00:02] 10SRE, 10Traffic, 10HTTPS, 10Performance-Team (Radar): Enable HTTP/3 (QUIC) support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10jbond) RFC assigned [[ https://datatracker.ietf.org/doc/html/rfc9000 | rfc9000 ]] [14:00:09] (03CR) 10Volans: [C: 03+2] admin: update dumisani's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/698790 (https://phabricator.wikimedia.org/T266791) (owner: 10Volans) [14:00:50] RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:42] (03PS1) 10Jbond: cas: add cas_configuration symlink [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698792 [14:02:16] (03CR) 10Bstorm: [C: 03+2] dumps distribution: uncomment sagres.c3sl.ufpr.br [puppet] - 10https://gerrit.wikimedia.org/r/698636 (owner: 10Bstorm) [14:02:53] jouncebot: next [14:02:54] In 1 hour(s) and 57 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210608T1600) [14:03:00] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell groups for DNdubane - https://phabricator.wikimedia.org/T266791 (10Volans) New key merged. The change will be distributed within ~30 minutes from now on all affected hosts. @DNdubane_WMF Please ensure you can connect aft... [14:03:20] (03CR) 10Kormat: [C: 03+2] pc1010: Move to pc3. [puppet] - 10https://gerrit.wikimedia.org/r/698778 (https://phabricator.wikimedia.org/T282761) (owner: 10Kormat) [14:03:44] (03CR) 10Kormat: [C: 03+2] db-eqiad.php: Set pc1010 as pc3 primary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698777 (https://phabricator.wikimedia.org/T282761) (owner: 10Kormat) [14:04:26] (03Merged) 10jenkins-bot: db-eqiad.php: Set pc1010 as pc3 primary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698777 (https://phabricator.wikimedia.org/T282761) (owner: 10Kormat) [14:05:07] !log setting pc1010 as pc3 primary T282761 [14:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:12] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [14:05:36] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:50] !log kormat@deploy1002 Synchronized wmf-config/db-eqiad.php: Set pc1010 as pc3 master T282761 (duration: 00m 57s) [14:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:22] (03CR) 10Marostegui: [C: 03+2] realm.pp: Add ldap_domains table to the private list [puppet] - 10https://gerrit.wikimedia.org/r/698718 (https://phabricator.wikimedia.org/T284106) (owner: 10Marostegui) [14:08:28] !log Restart sanitarium hosts (db2094, db2095, db1154, db1155) to pick up new filters T284106 [14:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:32] T284106: Replicate & sanitize wikitech data - https://phabricator.wikimedia.org/T284106 [14:08:39] (03CR) 10Volans: [C: 03+1] "LGTM, should go after Ie6f30f0d911bfa13c23714eb54315906a01d8309" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698792 (owner: 10Jbond) [14:08:58] (03PS1) 10Bstorm: dumps distribution: freemirror.org getting DNS fail [puppet] - 10https://gerrit.wikimedia.org/r/698793 [14:09:08] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:09:36] 10SRE, 10Traffic, 10HTTPS, 10Performance-Team (Radar): Enable HTTP/3 (QUIC) support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10hashar) [14:11:23] (03CR) 10Bstorm: [C: 03+2] dumps distribution: freemirror.org getting DNS fail [puppet] - 10https://gerrit.wikimedia.org/r/698793 (owner: 10Bstorm) [14:21:16] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10schoenbaechler) Hey @Volans — just tried out some dashboards on Superset, works as expected — thanks a lot! 👏 [14:22:21] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Volans) 05Stalled→03Resolved Great, resolving. [14:22:46] (03PS2) 10Muehlenhoff: Switch scandium/testreduce to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698155 (https://phabricator.wikimedia.org/T164456) [14:25:04] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:26:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698155 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:26:49] (03PS1) 10Marostegui: dbstore1007: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/698794 (https://phabricator.wikimedia.org/T283125) [14:27:49] (03CR) 10Marostegui: [C: 03+2] dbstore1007: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/698794 (https://phabricator.wikimedia.org/T283125) (owner: 10Marostegui) [14:30:22] (03CR) 10Volans: [C: 03+1] "LGTM, tests on netbox-next were successful, but we're testing it a bit more to be sure." [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [14:34:12] (03CR) 10Muehlenhoff: [C: 03+2] Switch scandium/testreduce to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698155 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:37:44] (03PS1) 10Effie Mouzeli: kubernetes::deployment_server: create a separate mediawiki profile [puppet] - 10https://gerrit.wikimedia.org/r/698795 [14:39:11] (03CR) 10jerkins-bot: [V: 04-1] kubernetes::deployment_server: create a separate mediawiki profile [puppet] - 10https://gerrit.wikimedia.org/r/698795 (owner: 10Effie Mouzeli) [14:43:38] !log cleanup now unused nginx mods and former deps (various X11 libs and libxslt) on testreduce1001/scandium after switch towards nginx-light T164456 [14:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:44] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [14:44:41] (03PS1) 10Jbond: P:netbox: Add support for cas authentication provider [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) [14:45:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell groups for DNdubane - https://phabricator.wikimedia.org/T266791 (10DNdubane_WMF) 05Open→03Resolved Thank you so much for such express service. I am now able to connect!! [14:45:08] (03PS2) 10Effie Mouzeli: kubernetes::deployment_server: create a separate mediawiki profile [puppet] - 10https://gerrit.wikimedia.org/r/698795 [14:45:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29817/console" [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [14:45:43] (03PS2) 10Muehlenhoff: Enable profile::nginx for acmechief [puppet] - 10https://gerrit.wikimedia.org/r/698509 (https://phabricator.wikimedia.org/T164456) [14:46:08] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: Add support for cas authentication provider [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [14:47:06] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation), 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [14:49:52] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:50:42] (03PS1) 10Muehlenhoff: role::docker_registry_ha::registry: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698800 (https://phabricator.wikimedia.org/T164456) [14:52:15] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:53:19] ^ there are two spikes in latency today; they both correlate to parsercache changes i made [14:53:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698800 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:54:32] effie: FYI ^ [14:55:58] (03PS1) 10Giuseppe Lavagetto: mediawiki: force curl to use the puppet CA [deployment-charts] - 10https://gerrit.wikimedia.org/r/698801 (https://phabricator.wikimedia.org/T284417) [14:56:20] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation), 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [14:56:33] kormat: anything actionable? [14:56:43] effie: nope. just gotta wait it out [14:56:58] basically both pc2 and pc3 are ~completely cold [14:57:05] ping me if you need anything [14:57:06] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: force curl to use the puppet CA [deployment-charts] - 10https://gerrit.wikimedia.org/r/698801 (https://phabricator.wikimedia.org/T284417) (owner: 10Giuseppe Lavagetto) [14:57:10] effie: ack, ty. [14:57:42] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Add the puppet CA to the MediaWiki deployment - https://phabricator.wikimedia.org/T284417 (10Joe) After further consideration, we came to the conclusion that the best course of action is: * For now, hotpatch the chart to use the puppet ca * Create a "wm... [14:58:16] <_joe_> uhm same error in ci heh [14:59:10] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp203[34].codfw.wmnet [14:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:48] (03PS1) 10Muehlenhoff: Switch docker registry to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698803 (https://phabricator.wikimedia.org/T164456) [15:04:27] !log powerdown cp2033 for relocation [15:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:39] PROBLEM - Host cp2033 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:11] (03PS2) 10Jbond: P:netbox: Add support for cas authentication provider [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) [15:08:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29818/console" [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [15:08:37] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: Add support for cas authentication provider [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [15:08:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698803 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [15:10:06] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/698546 (owner: 10DCausse) [15:11:30] (03PS3) 10Jbond: P:netbox: Add support for cas authentication provider [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) [15:12:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29819/console" [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [15:12:57] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: Add support for cas authentication provider [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [15:13:21] RECOVERY - Host cp2033 is UP: PING OK - Packet loss = 0%, RTA = 31.50 ms [15:13:37] !log powerdown cp2034 for relocation [15:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:41] (03CR) 10Klausman: [WIP] - Add the operators.d directory with basic Istio config (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [15:13:51] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:14:32] (03PS1) 10Jbond: O:netbox::standalone: switch netbox-next to use cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/698807 (https://phabricator.wikimedia.org/T244849) [15:15:07] (03PS17) 10Jbond: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [15:15:09] PROBLEM - Host cp2034 is DOWN: PING CRITICAL - Packet loss = 100% [15:15:43] (03PS2) 10Jbond: cas: add cas_configuration symlink [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698792 (https://phabricator.wikimedia.org/T244849) [15:16:22] (03PS4) 10Jbond: P:netbox: Add support for cas authentication provider [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) [15:17:00] (03PS2) 10Jbond: O:netbox::standalone: switch netbox-next to use cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/698807 (https://phabricator.wikimedia.org/T244849) [15:18:13] PROBLEM - Host cp2034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:18:33] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: Add support for cas authentication provider [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [15:19:04] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc2009.codfw.wmnet,pc1009.eqiad.wmnet with reason: Purging parsercache pc3 T282761 [15:19:05] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc2009.codfw.wmnet,pc1009.eqiad.wmnet with reason: Purging parsercache pc3 T282761 [15:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:09] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [15:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:51] (03PS5) 10Jbond: P:netbox: Add support for cas authentication provider [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) [15:21:02] 10SRE, 10Epic, 10Performance Issue, 10Release-Engineering-Team (Seen): [EPIC] Performance testing environment - https://phabricator.wikimedia.org/T67394 (10hashar) [15:21:09] (03PS1) 10Ottomata: airflow - Add support for configuring connections using LocalFilesystemBackend [puppet] - 10https://gerrit.wikimedia.org/r/698808 (https://phabricator.wikimedia.org/T272973) [15:21:14] (03PS3) 10Jbond: O:netbox::standalone: switch netbox-next to use cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/698807 (https://phabricator.wikimedia.org/T244849) [15:21:20] 10SRE, 10Epic, 10Performance Issue, 10Release-Engineering-Team (Seen): [EPIC] Performance testing environment - https://phabricator.wikimedia.org/T67394 (10hashar) [15:22:04] RECOVERY - Host cp2034 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [15:23:15] (03CR) 10jerkins-bot: [V: 04-1] airflow - Add support for configuring connections using LocalFilesystemBackend [puppet] - 10https://gerrit.wikimedia.org/r/698808 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [15:23:16] RECOVERY - Host cp2034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.65 ms [15:23:40] !log mwmaint1002: Running purge-parsercache-now.php on server 4/4 (pc1009) ref P16060, T280605, T282761. [15:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:45] T280605: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 [15:25:24] (03PS6) 10Jbond: P:netbox: Add support for cas authentication provider [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) [15:25:59] (03PS4) 10Jbond: O:netbox::standalone: switch netbox-next to use cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/698807 (https://phabricator.wikimedia.org/T244849) [15:26:51] (03PS2) 10DCausse: Add akhatun to analytics-search [puppet] - 10https://gerrit.wikimedia.org/r/698546 (https://phabricator.wikimedia.org/T284575) [15:26:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29823/console" [puppet] - 10https://gerrit.wikimedia.org/r/698807 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [15:27:51] (03PS3) 10Ottomata: Add akhatun to analytics-search [puppet] - 10https://gerrit.wikimedia.org/r/698546 (https://phabricator.wikimedia.org/T284575) (owner: 10DCausse) [15:30:16] (03PS7) 10Jbond: P:netbox: Add support for cas authentication provider [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) [15:30:33] (03PS5) 10Jbond: O:netbox::standalone: switch netbox-next to use cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/698807 (https://phabricator.wikimedia.org/T244849) [15:30:36] (03CR) 10Ottomata: [C: 03+2] Add akhatun to analytics-search [puppet] - 10https://gerrit.wikimedia.org/r/698546 (https://phabricator.wikimedia.org/T284575) (owner: 10DCausse) [15:31:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29824/console" [puppet] - 10https://gerrit.wikimedia.org/r/698807 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [15:32:22] (03PS1) 10Mforns: Migrate WMDEBanner* schemas to EventPlatform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698811 (https://phabricator.wikimedia.org/T282562) [15:33:56] !log powerdown thanos-fe2003 for relocation [15:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:20] marostegui: little favor to ask as an op here, could you try /invite jinxer-wm to this channel? thank you! [15:35:18] (03Abandoned) 10Jbond: cas: add cas_configuration symlink [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698792 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [15:35:34] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation), 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [15:36:42] PROBLEM - Host thanos-fe2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:24] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) @Dzahn While I have been able to access through Jupyter, I haven't been able to get the kerberos login. When I type in kinit it just hangs - Do I ne... [15:37:43] godog: done [15:39:08] (03Restored) 10Jbond: cas: add cas_configuration symlink [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698792 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [15:39:14] (03PS3) 10Jbond: cas: add cas_configuration symlink if the config exists in /etc/netbox [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698792 (https://phabricator.wikimedia.org/T244849) [15:39:40] PROBLEM - Host thanos-fe2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:40:07] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp203[34].codfw.wmnet [15:40:09] marostegui: thank you! yeah the bot isn't identifying to nickserv and even on /invite won't join, thank you ! [15:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:24] 8/govol [15:40:31] * jbond ignore me [15:40:59] !log powerdown ms-be2061 for relocation [15:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:45] 7ignore jbond [15:41:57] (03PS2) 10Giuseppe Lavagetto: mediawiki: force curl to use the puppet CA [deployment-charts] - 10https://gerrit.wikimedia.org/r/698801 (https://phabricator.wikimedia.org/T284417) [15:42:11] I realized only italian-layout using folks will get the joke [15:42:26] (shift-7 is where / is) [15:42:27] (03CR) 10Volans: [C: 03+1] "small typo, LGTM" (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698792 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [15:42:32] ahh :P [15:43:08] (03PS4) 10Jbond: cas: add cas_configuration symlink if the config exists in /etc/netbox [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698792 (https://phabricator.wikimedia.org/T244849) [15:43:24] yeah needless to say I didn't use italian layout for very long after I started using linux [15:43:38] PROBLEM - Host ms-be2061 is DOWN: PING CRITICAL - Packet loss = 100% [15:43:53] heh, same on german layout [15:44:06] unix paths made much more sense when I found out where it is on US layouts [15:44:37] yeah so much more convenient [15:45:16] RECOVERY - Alertmanager IRC relay is not connected on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Alertmanager%23Alerts https://grafana.wikimedia.org/d/eea-9_sik/alertmanager [15:47:47] (03PS1) 10Giuseppe Lavagetto: Fix undefined variables in Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/698813 [15:47:50] PROBLEM - Host ms-be2061.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:51:35] RECOVERY - Host thanos-fe2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.61 ms [15:52:01] RECOVERY - Host thanos-fe2003 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [15:52:01] (03PS5) 10Jbond: cas: add cas_configuration symlink [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698792 (https://phabricator.wikimedia.org/T244849) [15:53:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: force curl to use the puppet CA [deployment-charts] - 10https://gerrit.wikimedia.org/r/698801 (https://phabricator.wikimedia.org/T284417) (owner: 10Giuseppe Lavagetto) [15:53:35] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the fix." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698792 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [15:55:54] (03Merged) 10jenkins-bot: mediawiki: force curl to use the puppet CA [deployment-charts] - 10https://gerrit.wikimedia.org/r/698801 (https://phabricator.wikimedia.org/T284417) (owner: 10Giuseppe Lavagetto) [15:57:39] (03CR) 10Volans: [C: 03+1] "LGTM if the compiler is happy. Couple of nits inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [15:58:36] (03CR) 10Volans: [C: 03+1] "LGTM, would this also affect the cloud instance?" [puppet] - 10https://gerrit.wikimedia.org/r/698807 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [15:58:43] PROBLEM - Host db2100.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:00:04] jbond42 and cdanis: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210608T1600). [16:01:55] (03PS8) 10Jbond: P:netbox: Add support for cas authentication provider [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) [16:02:53] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:31] RECOVERY - Host ms-be2061.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.70 ms [16:04:24] (03CR) 10Jbond: "thanks" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [16:06:22] !log powerdown ms-backup2002 for relocation [16:06:25] RECOVERY - Host ms-be2061 is UP: PING OK - Packet loss = 0%, RTA = 33.45 ms [16:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:33] PROBLEM - Host ms-backup2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:11:08] (03CR) 10Jbond: [V: 03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/698807 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [16:12:09] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops: Ownership of the /tmp/mw-cache directories should be www-data in the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284581 (10Joe) [16:13:11] PROBLEM - Host ms-backup2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:13:52] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops: Ownership of the /tmp/mw-cache directories should be www-data in the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284581 (10Joe) I suspect we just need to remove the directories, as they only contain a cached configuration f... [16:17:22] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) >>! In T251918#7142275, @JMeybohm wrote: > Thanks @jbond for looking after this. I'll bluntly close this task again now. thanks and can confirm the lat... [16:18:51] RECOVERY - Host ms-backup2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.36 ms [16:20:39] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:21:17] 10SRE, 10SRE-Access-Requests: Add jgianellos and mbsantos to maps-root group - https://phabricator.wikimedia.org/T284135 (10Volans) @ssastry correct me if I'm wrong, but according to [[ https://www.mediawiki.org/wiki/Wikimedia_Maps/2021_modernization_plan#Who_will_be_leading_this_project? | Wikimedia_Maps/2021... [16:22:33] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [16:25:17] RECOVERY - Host ms-backup2002 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [16:25:39] (03PS6) 10Jbond: O:netbox::standalone: switch netbox-next to use cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/698807 (https://phabricator.wikimedia.org/T244849) [16:27:41] !log powerdown moss-fe2002 for relocation [16:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:47] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.37.0-wmf.9 refs T281150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698818 [16:30:49] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.37.0-wmf.9 refs T281150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698818 (owner: 10Jeena Huneidi) [16:32:21] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.9 refs T281150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698818 (owner: 10Jeena Huneidi) [16:32:22] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.9 refs T281150 [16:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:26] T281150: 1.37.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T281150 [16:36:13] (03PS1) 10David Caro: ceph: add cookbooks to reboot osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) [16:36:13] PROBLEM - Host moss-fe2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:37:15] (03PS9) 10Jbond: IDM: create new idm library with logoutd base class [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) [16:38:11] (03CR) 10Jbond: IDM: create new idm library with logoutd base class (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [16:38:17] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Ottomata) @JAnstee_WMF no you shouldn't. Where are you typing 'kinit'? Into an ssh terminal or into a Jupyter shell terminal? Also, how are you accessing Jup... [16:38:54] (03CR) 10jerkins-bot: [V: 04-1] ceph: add cookbooks to reboot osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [16:39:45] 10SRE, 10SRE-Access-Requests: Add jgianellos and mbsantos to maps-root group - https://phabricator.wikimedia.org/T284135 (10ssastry) Indeed. Thanks for flagging me. Approved. [16:42:00] RECOVERY - Host moss-fe2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.75 ms [16:42:04] (03CR) 10Volans: ceph: add cookbooks to reboot osds (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [16:44:47] RECOVERY - Host db2100.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.37 ms [16:45:12] 10SRE, 10Data-Persistence-Backup, 10bacula, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10cmooney) I've been able to find the source of the dropped traffic between eqiad and codfw. T... [16:50:43] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops: The restricted/mediawiki-multiversion image should include the production version of private/PrivateSettings.php - https://phabricator.wikimedia.org/T284582 (10Joe) [16:50:50] (03CR) 10Razzi: [C: 03+1] Swap CNAME/SRV records for dbstore1004 with dbstore1007's ones [dns] - 10https://gerrit.wikimedia.org/r/698729 (https://phabricator.wikimedia.org/T283125) (owner: 10Elukey) [16:53:08] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation), 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [16:53:34] 10SRE, 10Data-Persistence-Backup, 10bacula, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10cmooney) I also researched / played with TCP tunings. I don't believe the current CUBIC algo... [16:53:58] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation), 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) 05Open→03Resolved Complete [16:54:00] (03CR) 10Elukey: [C: 03+2] Swap CNAME/SRV records for dbstore1004 with dbstore1007's ones [dns] - 10https://gerrit.wikimedia.org/r/698729 (https://phabricator.wikimedia.org/T283125) (owner: 10Elukey) [16:56:45] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:30] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Papaul) 05Open→03Resolved CPU and main board replaced server is backup online [17:00:04] chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210608T1700). [17:00:06] (03PS1) 10Elukey: Fix the dbstore1007 IP after changing VLAN in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/698824 (https://phabricator.wikimedia.org/T283125) [17:00:18] XioNoX: --^ FYI [17:00:44] elukey: ? [17:00:59] ah right you don't have the code review notifications, I always forget [17:01:10] I am deploying https://gerrit.wikimedia.org/r/c/operations/homer/public/+/698824/1/templates/cr/firewall.conf if you are ok [17:01:37] (03CR) 10Ayounsi: [C: 03+1] Fix the dbstore1007 IP after changing VLAN in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/698824 (https://phabricator.wikimedia.org/T283125) (owner: 10Elukey) [17:01:39] yep [17:01:51] thanks :) [17:01:59] (03CR) 10Elukey: [C: 03+2] Fix the dbstore1007 IP after changing VLAN in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/698824 (https://phabricator.wikimedia.org/T283125) (owner: 10Elukey) [17:05:25] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:06:11] (03PS1) 10Jdlrobson: Set `.mw-echo-alert` class on link instead of list-item [extensions/Echo] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698680 (https://phabricator.wikimedia.org/T284496) [17:06:34] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.9 refs T281150 (duration: 34m 12s) [17:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:39] T281150: 1.37.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T281150 [17:08:47] 10SRE, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to contint-admins for Ladsgroup - https://phabricator.wikimedia.org/T283925 (10hashar) 05Resolved→03Open @Ladsgroup now has access to the server, but I completely missed he already needs adm... [17:08:53] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:10:36] !log fix dbstore1007's ip address in analytics-in4 on cr{1,2}-eqiad [17:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:27] (03CR) 10Volans: "I've quickly tested it and looks good to me, just one change needed and couple of optional nits. Conceptually +1" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [17:18:57] (03CR) 10Jgiannelos: [C: 03+2] Bump chromium-render to version 2021-05-24-102519-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/694413 (owner: 10Jgiannelos) [17:20:53] (03CR) 10Volans: "Couple of minor things inline, LGTM as a starting point." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [17:21:24] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Marostegui) Thanks Papaul - I have started mysql and it all went fine. I have stopped mysql again. Will leave it up to Jaime to decide if he wants to re... [17:21:25] (03PS1) 10Ahmon Dancy: Clean up cruft in /tmp/mw-cache-* before publishing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698828 (https://phabricator.wikimedia.org/T284581) [17:21:30] (03Merged) 10jenkins-bot: Bump chromium-render to version 2021-05-24-102519-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/694413 (owner: 10Jgiannelos) [17:24:48] 10SRE, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to contint-admins for Ladsgroup - https://phabricator.wikimedia.org/T283925 (10Marostegui) 05Open→03Resolved Done! ` root@mwmaint1002:~# ldapsearch -x cn=ciadmin | grep lads member: uid=lads... [17:25:09] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [17:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:37] (03CR) 10Volans: [C: 03+1] "I think we got a final version! Time to write some tests ;)" (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [17:32:30] (03PS10) 10Jbond: IDM: create new idm library with logoutd base class [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) [17:35:11] (03CR) 10Jbond: IDM: create new idm library with logoutd base class (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [17:36:40] (03PS19) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) [17:36:45] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [17:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:03] (03CR) 10Jbond: profile::contacts: add a profile and define for adding contact metadata (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [17:37:06] (03PS1) 10Effie Mouzeli: (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 [17:38:53] (03CR) 10jerkins-bot: [V: 04-1] (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 (owner: 10Effie Mouzeli) [17:41:21] (03PS1) 10Bartosz Dziewoński: Update surface styles for VE changes [extensions/DiscussionTools] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698681 (https://phabricator.wikimedia.org/T284567) [17:42:13] (03CR) 10Bartosz Dziewoński: [C: 03+1] "Good to go whenever" [extensions/DiscussionTools] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698681 (https://phabricator.wikimedia.org/T284567) (owner: 10Bartosz Dziewoński) [17:48:29] (03PS3) 10Effie Mouzeli: kubernetes::deployment_server: create a separate mediawiki profile [puppet] - 10https://gerrit.wikimedia.org/r/698795 [17:48:52] (03PS2) 10Effie Mouzeli: (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 [17:49:57] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [17:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:32] (03CR) 10jerkins-bot: [V: 04-1] (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 (owner: 10Effie Mouzeli) [17:51:07] (03PS3) 10MarcoAurelio: enwiki: Remove 'collectionsaveascommunitypage' from the 'user' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698041 (https://phabricator.wikimedia.org/T283523) [17:52:02] (03CR) 10MarcoAurelio: enwiki: Remove 'collectionsaveascommunitypage' from the 'user' group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698041 (https://phabricator.wikimedia.org/T283523) (owner: 10MarcoAurelio) [17:55:17] (03PS1) 10Ladsgroup: ruby tests: update gems for newer watir version [extensions/Wikibase] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698682 (https://phabricator.wikimedia.org/T280491) [17:55:27] (03CR) 10Ladsgroup: [C: 03+2] ruby tests: update gems for newer watir version [extensions/Wikibase] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698682 (https://phabricator.wikimedia.org/T280491) (owner: 10Ladsgroup) [17:56:26] (03PS1) 10Ladsgroup: ruby tests: update gems for newer watir version [extensions/Wikibase] (wmf/1.37.0-wmf.8) - 10https://gerrit.wikimedia.org/r/698683 (https://phabricator.wikimedia.org/T280491) [17:57:04] (03CR) 10Ladsgroup: [C: 03+2] ruby tests: update gems for newer watir version [extensions/Wikibase] (wmf/1.37.0-wmf.8) - 10https://gerrit.wikimedia.org/r/698683 (https://phabricator.wikimedia.org/T280491) (owner: 10Ladsgroup) [17:57:56] (03Abandoned) 10Ladsgroup: ruby tests: update gems for newer watir version [extensions/Wikibase] (wmf/1.37.0-wmf.8) - 10https://gerrit.wikimedia.org/r/698683 (https://phabricator.wikimedia.org/T280491) (owner: 10Ladsgroup) [17:58:23] (03PS4) 10Effie Mouzeli: kubernetes::deployment_server: create a separate mediawiki profile [puppet] - 10https://gerrit.wikimedia.org/r/698795 [17:58:32] (03PS1) 10Ladsgroup: ruby tests: update gems for newer watir version [extensions/Wikibase] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698684 (https://phabricator.wikimedia.org/T280491) [17:58:47] (03PS3) 10Effie Mouzeli: (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 [17:58:57] (03CR) 10Ladsgroup: [C: 03+2] ruby tests: update gems for newer watir version [extensions/Wikibase] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698684 (https://phabricator.wikimedia.org/T280491) (owner: 10Ladsgroup) [17:59:15] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) @Ottomata Entered kinit in terminal. Accessed Jupyter hub via localhost:8880 [18:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210608T1800) [18:00:35] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Ottomata) > Entered kinit in terminal Which terminal? In your browser in Jupyter or via ssh? Might be hard to troubleshoot this async, wanna ping me on IRC in #... [18:00:48] (03CR) 10jerkins-bot: [V: 04-1] (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 (owner: 10Effie Mouzeli) [18:02:20] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Ottomata) > Accessed Jupyter hub via localhost:8880 Which stat box? [18:04:42] (03PS4) 10Effie Mouzeli: (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 [18:05:32] (03PS4) 10MarcoAurelio: enwiki: Remove 'collectionsaveascommunitypage' from the 'user' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698041 (https://phabricator.wikimedia.org/T283523) [18:06:24] (03PS5) 10MarcoAurelio: enwiki: Remove 'collectionsaveascommunitypage' from the 'autoconfirmed' and 'confirmed' user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698041 (https://phabricator.wikimedia.org/T283523) [18:06:41] (03CR) 10jerkins-bot: [V: 04-1] (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 (owner: 10Effie Mouzeli) [18:10:05] RECOVERY - Disk space on releases1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops [18:11:33] 10SRE: Need to ssh with my new laptop - https://phabricator.wikimedia.org/T284588 (10MMiller_WMF) [18:12:05] (03CR) 10Jeena Huneidi: [C: 03+2] "Merging according to https://phabricator.wikimedia.org/T281150" [extensions/Echo] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698680 (https://phabricator.wikimedia.org/T284496) (owner: 10Jdlrobson) [18:13:58] (03PS1) 10Bstorm: dumps distribution: remove mirrors.freemirror.org [puppet] - 10https://gerrit.wikimedia.org/r/698836 [18:15:16] (03PS1) 10Jdlrobson: Fix MonoBook orange banner hover styles [extensions/Echo] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698848 (https://phabricator.wikimedia.org/T284496) [18:17:02] 10SRE, 10Wikimedia-Mailing-lists: mailman3 unsubscribe link not showing in daily article list e-mails - https://phabricator.wikimedia.org/T284548 (10Legoktm) @krd you should be able to edit the footer using the template at https://lists.wikimedia.org/postorius/lists/daily-article-l.lists.wikimedia.org/template... [18:19:33] 10SRE: Need to ssh with my new laptop - https://phabricator.wikimedia.org/T284588 (10RLazarus) My guess is, your SSH key always had a passphrase associated, but on your old laptop, the Mac OS keychain was storing it for you (that's the "UseKeychain yes") in your `.ssh/config`. I don't know much of anything about... [18:22:08] (03CR) 10Ahmon Dancy: [C: 03+2] "Tested manually." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698828 (https://phabricator.wikimedia.org/T284581) (owner: 10Ahmon Dancy) [18:22:58] (03Merged) 10jenkins-bot: Clean up cruft in /tmp/mw-cache-* before publishing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698828 (https://phabricator.wikimedia.org/T284581) (owner: 10Ahmon Dancy) [18:23:57] 10SRE, 10serviceops, 10Patch-For-Review: Publish wikimedia-bullseye base docker image - https://phabricator.wikimedia.org/T281596 (10Majavah) [18:24:10] rzl: airdrop is just a way to copy files between Apple stuff [18:25:00] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: Ownership of the /tmp/mw-cache directories should be www-data in the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284581 (10dancy) 05Open→03Resolved a:03dancy @Joe This should be fixed now. You... [18:25:06] 10SRE, 10MW-on-K8s, 10serviceops: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10dancy) [18:26:19] (03Merged) 10jenkins-bot: ruby tests: update gems for newer watir version [extensions/Wikibase] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698682 (https://phabricator.wikimedia.org/T280491) (owner: 10Ladsgroup) [18:26:22] (03CR) 10jerkins-bot: [V: 04-1] ruby tests: update gems for newer watir version [extensions/Wikibase] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698684 (https://phabricator.wikimedia.org/T280491) (owner: 10Ladsgroup) [18:27:24] 10SRE: Need to ssh with my new laptop - https://phabricator.wikimedia.org/T284588 (10RhinosF1) Hi, If you open an app called 'Keychain' on your device. (Icon is some keys). You'll see a list of passwords for things. Try searching in their to see if it's listed. It might be under id_rsa. Thanks! [18:28:05] (03CR) 10Ladsgroup: [C: 03+2] "." [extensions/Wikibase] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698684 (https://phabricator.wikimedia.org/T280491) (owner: 10Ladsgroup) [18:29:55] 10ops-codfw: rack server procyon - oit backup server - https://phabricator.wikimedia.org/T87029 (10RobH) [18:30:03] 10SRE, 10ops-codfw: setup/deploy server procyon - corporate oit backup server - https://phabricator.wikimedia.org/T87028 (10RobH) [18:31:19] 10SRE, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Apereo CAS expose CASCookieSameSite via profile::idp::client::http - https://phabricator.wikimedia.org/T264605 (10jbond) 05Open→03Resolved a:05MoritzMuehlenhoff→03jbond This should be complete now please re-open if you still see issues, thanks [18:35:16] (03CR) 10Jeena Huneidi: [C: 03+2] Fix MonoBook orange banner hover styles [extensions/Echo] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698848 (https://phabricator.wikimedia.org/T284496) (owner: 10Jdlrobson) [18:36:21] (03Merged) 10jenkins-bot: Set `.mw-echo-alert` class on link instead of list-item [extensions/Echo] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698680 (https://phabricator.wikimedia.org/T284496) (owner: 10Jdlrobson) [18:43:42] !log apt: update gdnsd package to gdnsd-3.7.0-1~wmf1 [18:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:08] (03PS1) 10Ahmon Dancy: Use the production version of private/PrivateSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698837 (https://phabricator.wikimedia.org/T284582) [18:47:31] !log dns4001: update gdnsd to 3.7.0-1~wmf1 [18:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:01] (03Merged) 10jenkins-bot: ruby tests: update gems for newer watir version [extensions/Wikibase] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698684 (https://phabricator.wikimedia.org/T280491) (owner: 10Ladsgroup) [18:52:32] I need to rebase this on mwdeploy1002 but I can't find my yubikey :D [18:55:23] (03Merged) 10jenkins-bot: Fix MonoBook orange banner hover styles [extensions/Echo] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698848 (https://phabricator.wikimedia.org/T284496) (owner: 10Jdlrobson) [18:56:09] I got it Amir1 - if you want it back, send some waffels my way [18:56:25] found it [18:56:30] it was in keyholder [18:57:08] of course a key chain would be in the key holder and yet I was looking everywhere in the house [18:57:35] Understandable [18:57:38] :) [18:57:44] Happens to me all the time [18:58:30] You see I just forget to put them down so keys aren't in keybox [18:58:59] (03PS5) 10Effie Mouzeli: (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 [19:00:05] longma and twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210608T1900). [19:00:33] (03CR) 10jerkins-bot: [V: 04-1] (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 (owner: 10Effie Mouzeli) [19:02:55] (03PS6) 10Effie Mouzeli: (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 [19:03:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) a:05Cmjohnson→03RobH @robh mw1423-1447 are ready for you now as well [19:04:38] (03CR) 10jerkins-bot: [V: 04-1] (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 (owner: 10Effie Mouzeli) [19:07:50] Train is blocked by https://phabricator.wikimedia.org/T284567, so no deployment until that's fixed [19:08:30] !log [WDQS] `ryankemper@wdqs1005:~$ sudo pool` (all caught up on lag) [19:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:14] I rebased the Echo patch on wmf.9 as well [19:16:18] jeena: ^ [19:17:25] Amir1: sorry, which patch? [19:17:27] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-blazegraph.service,wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:47] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/698848 [19:18:22] ah yeah, I merged that already [19:18:37] !log T280382 `sudo systemctl stop wdqs-updater wdqs-blazegraph` on `wdqs1010` in preparation for transfer [19:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:41] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [19:19:41] jeena: but it wasn't rebased on mwdeploy1002 [19:19:48] Do you want me to sync it too? [19:19:49] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:56] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [19:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:36] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:44] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1009.eqiad.wmnet --dest wdqs1010.eqiad.wmnet --reason "transferring skolemized wikidata.jnl so we can reimage wdqs1009" --blazegraph_instance blazegraph --without-lvs` on `ryankemper@cumin1001` tmux session `wdqs_1009` [19:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:48] oh, I see. So after talking to Jdlrobson, I understood it was okay to just deploy it to group0, but now I'm waiting on another issue, so I guess we could sync it [19:23:32] lmk if you want me to do it or you're already doing it [19:25:15] !log apt: update gdnsd package to gdnsd-3.7.0-2~wmf1 (fix systemd reload issues) [19:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_streaming_updater site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:26:08] !log dns400[12]: update gdnsd to 3.7.0-3~wmf1 [19:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:53] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) I deduced from your question that one terminal location was right and the other one wrong and have now been able to authenticate in jupyter - thanks... [19:28:44] 10SRE, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to contint-admins for Ladsgroup - https://phabricator.wikimedia.org/T283925 (10hashar) Thank you very much @Marostegui , much appreciated. [19:32:47] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.9/extensions/Echo/modules/nojs/mw.echo.alert.monobook.less: Backport: [[gerrit:698848|Fix MonoBook orange banner hover styles (T284496)]] (duration: 01m 08s) [19:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:52] T284496: Regression: Echo new talk page message banner has lost its orange background - https://phabricator.wikimedia.org/T284496 [19:33:37] jeena: synced ^ [19:34:11] thanks! I was just logging into mwdebug to check! [19:34:24] (03CR) 10Ahmon Dancy: [C: 03+2] "Verified manually." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698837 (https://phabricator.wikimedia.org/T284582) (owner: 10Ahmon Dancy) [19:35:04] (03Merged) 10jenkins-bot: Use the production version of private/PrivateSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698837 (https://phabricator.wikimedia.org/T284582) (owner: 10Ahmon Dancy) [19:36:12] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: The restricted/mediawiki-multiversion image should include the production version of private/PrivateSettings.php - https://phabricator.wikimedia.org/T284582 (10dancy) 05Open→03Resolved a:03dancy @Joe This should be fi... [19:36:19] 10SRE, 10MW-on-K8s, 10serviceops: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10dancy) [19:36:22] !log T280382 Cancelling the data-transfer run to restart it; realized that the cookbook will start up the `wdqs-updater` again so will locally hack the cookbook on `cumin1001` to prevent that [19:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:26] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [19:36:28] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [19:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:58] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) 05Open→03Resolved Tentatively closing since this sounds like issues are resolved. Feel free to reopen it if there is anything else missing. [19:40:20] Amir1: I'm planning to merge something to wmf.9 and proceed with the train now. That won't interfere with anything you are doing, right? [19:40:35] yes. I'm done basically [19:40:42] okay thanks [19:41:24] (03CR) 10Jeena Huneidi: [C: 03+2] Update surface styles for VE changes [extensions/DiscussionTools] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698681 (https://phabricator.wikimedia.org/T284567) (owner: 10Bartosz Dziewoński) [19:41:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:43:00] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:42] !log dns[1235]001: update gdnsd to 3.7.0-2~wmf1 [19:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_streaming_updater site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:47:36] (03Merged) 10jenkins-bot: Update surface styles for VE changes [extensions/DiscussionTools] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/698681 (https://phabricator.wikimedia.org/T284567) (owner: 10Bartosz Dziewoński) [19:50:51] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Ottomata) :) [19:51:02] (03PS1) 10Jeena Huneidi: group0 wikis to 1.37.0-wmf.9 refs T281150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698845 [19:51:04] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.37.0-wmf.9 refs T281150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698845 (owner: 10Jeena Huneidi) [19:51:47] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.9 refs T281150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698845 (owner: 10Jeena Huneidi) [19:52:45] 10SRE, 10SRE-Access-Requests: Need to ssh with my new laptop - https://phabricator.wikimedia.org/T284588 (10Dzahn) [19:53:29] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.9 refs T281150 [19:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:33] T281150: 1.37.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T281150 [19:55:46] !log dns[1235]002: update gdnsd to 3.7.0-2~wmf1 [19:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:50] !log authdns2001: update gdnsd to 3.7.0-2~wmf1 [20:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:07] !log authdns1001: update gdnsd to 3.7.0-2~wmf1 [20:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:12] 10SRE, 10Technical-blog-posts, 10Wikimedia-Mailing-lists: Story idea for Blog: Discovering and fixing CVE-2021-33038 in Mailman3 - https://phabricator.wikimedia.org/T284486 (10srodlund) @Legoktm Great! Thank you for sharing this! It should be pretty straightforward to publish this in the blog! I am going t... [20:42:44] 10SRE, 10Technical-blog-posts, 10Wikimedia-Mailing-lists: Story idea for Blog: Discovering and fixing CVE-2021-33038 in Mailman3 - https://phabricator.wikimedia.org/T284486 (10srodlund) >>! In T284486#7140701, @Ladsgroup wrote: > We should write a blog post about the upgrade in general too. Maybe later. @la... [20:42:58] (03PS2) 10Ssingh: site: add doh4001 to role insetup and setup dhcp [puppet] - 10https://gerrit.wikimedia.org/r/698265 (https://phabricator.wikimedia.org/T284349) (owner: 10Cwhite) [20:45:20] (03CR) 10Dzahn: [C: 03+1] site: add doh4001 to role insetup and setup dhcp [puppet] - 10https://gerrit.wikimedia.org/r/698265 (https://phabricator.wikimedia.org/T284349) (owner: 10Cwhite) [20:45:50] (03CR) 10Ssingh: [C: 03+2] site: add doh4001 to role insetup and setup dhcp [puppet] - 10https://gerrit.wikimedia.org/r/698265 (https://phabricator.wikimedia.org/T284349) (owner: 10Cwhite) [21:03:36] (03PS3) 10Ssingh: site: add wikidough eqiad with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/698505 (https://phabricator.wikimedia.org/T284348) [21:07:20] (03CR) 10ArielGlenn: [C: 03+1] "Bummer!" [puppet] - 10https://gerrit.wikimedia.org/r/698836 (owner: 10Bstorm) [21:12:33] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:37] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-blazegraph.service,wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:14:22] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T284349 (10ssingh) >>! In T284349#7135796, @colewhite wrote: > Cookbook ran successfully. Currently unprovisioned. Thanks for creating the VM! [21:27:56] !log T280382 Disabled puppet on `wdqs1010` out of abundance of caution; will re-enable after wdqs1009 is reimaged and xfer back is complete [21:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:01] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [21:28:27] PROBLEM - Too high an incoming rate of browser-reported Network Error Logging events on alert1001 is CRITICAL: type=tcp.address_unreachable https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [21:29:09] (03CR) 10Dzahn: [C: 03+1] site: add wikidough eqiad with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/698505 (https://phabricator.wikimedia.org/T284348) (owner: 10Ssingh) [21:29:11] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1009.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `wdqs_1009` [21:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:11] 👀 that's quite a NEL graph [21:33:21] PROBLEM - Too high an incoming rate of browser-reported Network Error Logging events on alert1001 is CRITICAL: type=tcp.address_unreachable https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [21:34:41] RECOVERY - Too high an incoming rate of browser-reported Network Error Logging events on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [21:34:51] looks like the tcp.address_unreachable was mostly for upload.wm.o but not exclusively, and I can't pull out any patterns in country/isp/etc [21:35:16] (03PS1) 10Ryan Kemper: Revert "temp limit GAE/GCE traffic towards search API" [puppet] - 10https://gerrit.wikimedia.org/r/698849 [21:41:38] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) [[ https://wikitech.wikimedia.org/w/index.php?title=Service_ports&type=revision&diff=1914806&oldid=1913236 | reserved port 4111 ]] as public TLS service po... [21:41:53] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) [21:42:50] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1009.eqiad.wmnet with reason: REIMAGE [21:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:18] (03PS1) 10Jforrester: admin: Add second SSH key for jforrester [puppet] - 10https://gerrit.wikimedia.org/r/698877 [21:44:53] (03CR) 10RLazarus: [C: 03+1] "You probably want to add "Bug: T284479" to the commit message but otherwise looks good, deployment as discussed on IRC :)" [puppet] - 10https://gerrit.wikimedia.org/r/698849 (owner: 10Ryan Kemper) [21:44:56] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1009.eqiad.wmnet with reason: REIMAGE [21:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:53:04] (03PS2) 10Ryan Kemper: Revert "temp limit GAE/GCE traffic towards search API" [puppet] - 10https://gerrit.wikimedia.org/r/698849 (https://phabricator.wikimedia.org/T284479) [21:57:47] (03CR) 10Ryan Kemper: [C: 03+2] Revert "temp limit GAE/GCE traffic towards search API" [puppet] - 10https://gerrit.wikimedia.org/r/698849 (https://phabricator.wikimedia.org/T284479) (owner: 10Ryan Kemper) [21:58:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_streaming_updater site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:59:43] !log T284479 Prior context: We put a block on a range of Google App Engine IPs yesterday to protect Cirrussearch from a bad actor; now we're going to try lifting the block and seeing if we're still getting slammed with traffic [21:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:48] T284479: Cirrussearch: spike in pool counter rejections related to full_text and entity_full_text queries - https://phabricator.wikimedia.org/T284479 [22:01:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:01:44] !log T284479 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/698849, running puppet on `cp3052.esams.wmnet` [22:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:19] (03PS2) 10Jforrester: admin: Add second SSH key for jforrester [puppet] - 10https://gerrit.wikimedia.org/r/698877 (https://phabricator.wikimedia.org/T284613) [22:03:49] !log T284479 Successful puppet run on `cp3052`, proceeding to rest of `A:cp-text`: `sudo cumin -b 15 'A:cp-text' 'run-puppet-agent -q'` [22:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_streaming_updater site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:09:27] !log T284479 Puppet run complete across all of `cp-text`. Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?viewPanel=47&orgId=1&from=now-1h&to=now over the next few minutes to see if we see a large spike in `full_text` and `entity_full_text` queries [22:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:31] T284479: Cirrussearch: spike in pool counter rejections related to full_text and entity_full_text queries - https://phabricator.wikimedia.org/T284479 [22:10:07] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:30] !log T284479 Already starting to see a large upward spike in requests. Doing a quick sanity check to make sure this is out of the ordinary but I'll likely be putting the block back in place shortly [22:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:58] !log T284479 Yup more than enough evidence of a strong upward spike now. Proceeding to revert [22:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:35] (03PS1) 10Ryan Kemper: Revert "Revert "temp limit GAE/GCE traffic towards search API"" [puppet] - 10https://gerrit.wikimedia.org/r/698850 [22:12:00] (03CR) 10Jforrester: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/698583 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [22:12:28] (03PS2) 10Ryan Kemper: Revert "Revert "temp limit GAE/GCE traffic towards search API"" [puppet] - 10https://gerrit.wikimedia.org/r/698850 (https://phabricator.wikimedia.org/T284479) [22:13:15] (03CR) 10Ryan Kemper: [C: 03+2] Revert "Revert "temp limit GAE/GCE traffic towards search API"" [puppet] - 10https://gerrit.wikimedia.org/r/698850 (https://phabricator.wikimedia.org/T284479) (owner: 10Ryan Kemper) [22:14:19] !log T284479 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/698850, running puppet on `cp3052.esams.wmnet` [22:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:47] !log T284479 Successful puppet run on `cp3052`, proceeding to rest of `A:cp-text`: `sudo cumin -b 19 'A:cp-text' 'run-puppet-agent -q'` [22:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:51] T284479: Cirrussearch: spike in pool counter rejections related to full_text and entity_full_text queries - https://phabricator.wikimedia.org/T284479 [22:21:56] !log T284479 Block put back in place. We're back to expected traffic levels. We'll need a more granular mitigation in place before we can lift this block going forward. [22:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:00] T284479: Cirrussearch: spike in pool counter rejections related to full_text and entity_full_text queries - https://phabricator.wikimedia.org/T284479 [22:25:13] (03PS1) 10Jforrester: docker-reporter: Ignore dropped image releng/node10-kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/698883 [22:36:33] !log krinkle@deploy1002 Started deploy [integration/docroot@d4c9e08]: (no justification provided) [22:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:41] !log krinkle@deploy1002 Finished deploy [integration/docroot@d4c9e08]: (no justification provided) (duration: 00m 08s) [22:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:42] PROBLEM - Host cloudmetrics1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210608T2300). [23:00:05] samwilson: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:02:30] I'm here [23:02:46] (03PS3) 10Samwilson: Enable Wikisource OCR on select Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698654 (https://phabricator.wikimedia.org/T283898) [23:13:45] RoanKattouw urbanecm: are one of you doing the deploy? [23:23:41] ACKNOWLEDGEMENT - Host cloudmetrics1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% BryanDavis https://phabricator.wikimedia.org/T281881 [23:38:35] RECOVERY - Host cloudmetrics1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.00 ms [23:44:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet - https://phabricator.wikimedia.org/T281881 (10Jclark-ctr) I was able to update Firmware host is back up now