[00:23:20] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:25:28] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:51:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:00:52] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:51:22] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:53:34] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:28:01] (03PS1) 10Marostegui: dbproxy2004: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/748569 (https://phabricator.wikimedia.org/T295965) [06:30:33] (03CR) 10Marostegui: [C: 03+2] dbproxy2004: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/748569 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [06:41:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy2004.codfw.wmnet with OS bullseye [06:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:54] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:40] (03PS1) 10Marostegui: install_server: Allow reimage dbproxy2004 [puppet] - 10https://gerrit.wikimedia.org/r/748571 (https://phabricator.wikimedia.org/T295965) [07:08:29] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy2004.codfw.wmnet with OS bullseye [07:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:38] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage dbproxy2004 [puppet] - 10https://gerrit.wikimedia.org/r/748571 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [07:12:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy2004.codfw.wmnet with OS bullseye [07:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:40] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy2004.codfw.wmnet with OS bullseye [07:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:12] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) [08:14:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2015.codfw.wmnet with OS buster [08:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:50] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2015.codfw.wmnet with OS buster [08:18:30] 10SRE: Allow Wikimedia Maps usage on wikijournal.org - https://phabricator.wikimedia.org/T297948 (10Fokebox) @Aklapper my project is not supported by Wikimedia Affiliate. Just wondered if I can use Wikimediamaps with Karographer extension/ [08:40:46] !log updated bullseye installer images for 11.2 point release [08:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy2004.codfw.wmnet with OS bullseye [08:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:40] 10SRE: Allow Wikimedia Maps usage on wikijournal.org - https://phabricator.wikimedia.org/T297948 (10Peachey88) @Fokebox Have you read https://wikitech.wikimedia.org/wiki/Maps/External_usage and https://foundation.wikimedia.org/wiki/Maps_Terms_of_Use ? [08:50:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2015.codfw.wmnet with OS buster [08:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:06] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2015.codfw.wmnet with OS buster completed: - ganeti2015 (**PASS**) - Downtimed on Icinga... [09:03:01] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [09:03:08] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:03:11] (03CR) 10JMeybohm: [C: 03+1] "LGTM (assuming no migration for the DB is needed in the current state of this)" [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/747881 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [09:03:37] (03CR) 10JMeybohm: [C: 03+1] Fix --clusters command line parsing and add tests [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748232 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [09:04:13] (03CR) 10JMeybohm: [C: 03+2] hieradata: Empty kubernetes_cluster_groups on wmcs [puppet] - 10https://gerrit.wikimedia.org/r/748092 (https://phabricator.wikimedia.org/T297853) (owner: 10Majavah) [09:06:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2015.codfw.wmnet [09:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:48] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:13:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2015.codfw.wmnet [09:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2004.codfw.wmnet with OS bullseye [09:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:52] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [09:28:43] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2006.codfw.wmnet with reason: switch to drbd storage [09:28:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2006.codfw.wmnet with reason: switch to drbd storage [09:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:33] !log switch kubetcd2006 to DRBD storage to allow eventual migration for reimage of ganeti2019 [09:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:57] !log Stop mysql on db2078:3325 to check new haproxy on bullseye T295965 [09:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:04] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [09:47:31] (03PS11) 10JMeybohm: admin_ng: Create Certificates for ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) [09:49:44] (03CR) 10JMeybohm: admin_ng: Create Certificates for ingressgateway (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) (owner: 10JMeybohm) [09:51:45] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [09:53:36] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:04] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2001.codfw.wmnet with reason: switch to drbd storage [09:54:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2001.codfw.wmnet with reason: switch to drbd storage [09:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:30] !log Stop mysql on db2135 to check new haproxy on bullseye T295965 [09:56:32] !log switch ml-etcd2001 to DRBD storage to allow eventual migration for reimage of ganeti2019 [09:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:35] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [09:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:26] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:03] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [10:10:16] PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:10:36] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:13] (03PS1) 10Ladsgroup: passwords: Add ladsgroup to the cloud root [labs/private] - 10https://gerrit.wikimedia.org/r/748699 [10:17:45] (03CR) 10Ladsgroup: "Hi, I added my key that I put in my wikitech. Is that correct or I need to set a dedicated key for it?" [labs/private] - 10https://gerrit.wikimedia.org/r/748699 (owner: 10Ladsgroup) [10:22:41] (03CR) 10Jelto: [V: 03+1] gitlab_runner: use config template for registering new runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:24:38] (03PS1) 10Kormat: tox.ini: Drop cover, make unit run coverage. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/748700 [10:26:27] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:26:46] (03CR) 10Kormat: [C: 03+2] tox.ini: Drop cover, make unit run coverage. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/748700 (owner: 10Kormat) [10:29:41] (03CR) 10Majavah: passwords: Add ladsgroup to the cloud root (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/748699 (owner: 10Ladsgroup) [10:31:44] (03CR) 10Ladsgroup: passwords: Add ladsgroup to the cloud root (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/748699 (owner: 10Ladsgroup) [10:33:53] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Joe) Yes sorry, I dropped the ball on this. We need the servers to be spread across rows as much as possible, so: - 5 servers per row in two rows - 4 servers per row... [10:47:21] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] passwords: Add ladsgroup to the cloud root (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/748699 (owner: 10Ladsgroup) [10:47:52] (03CR) 10Majavah: passwords: Add ladsgroup to the cloud root (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/748699 (owner: 10Ladsgroup) [10:48:59] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] passwords: Add ladsgroup to the cloud root (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/748699 (owner: 10Ladsgroup) [11:08:04] (03PS1) 10Jelto: Rakefile: check only client helm version [deployment-charts] - 10https://gerrit.wikimedia.org/r/748701 (https://phabricator.wikimedia.org/T251305) [11:24:28] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1025.eqiad.wmnet with OS buster [11:24:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster [11:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:31] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:29:18] !log volans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti1025.eqiad.wmnet with OS buster [11:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster executed with er... [11:36:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10Volans) >>! In T293909#7578332, @Cmjohnson wrote: > @Volans These servers will not install correctly, I noticed that these have embedded 1G nic car... [11:37:25] (03CR) 10Ladsgroup: passwords: Add ladsgroup to the cloud root (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/748699 (owner: 10Ladsgroup) [11:39:13] (03PS1) 10JMeybohm: Add generic probes/metrics networkpolicy to cert-manager/cfssl [deployment-charts] - 10https://gerrit.wikimedia.org/r/748703 (https://phabricator.wikimedia.org/T294560) [11:49:26] (03CR) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: restrict firewall rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/376024 (owner: 10Giuseppe Lavagetto) [11:54:56] (03PS3) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: refactor function to handle dead config [puppet] - 10https://gerrit.wikimedia.org/r/748116 [11:54:58] (03PS3) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: delete more dead config [puppet] - 10https://gerrit.wikimedia.org/r/748132 [11:55:11] (03PS1) 10Btullis: Increase the threshold EventgateLoggingExternalLatency [alerts] - 10https://gerrit.wikimedia.org/r/748704 (https://phabricator.wikimedia.org/T294911) [11:55:54] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: delete more dead config [puppet] - 10https://gerrit.wikimedia.org/r/748132 (owner: 10Arturo Borrero Gonzalez) [12:02:44] (03PS4) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: delete more dead config [puppet] - 10https://gerrit.wikimedia.org/r/748132 [12:03:32] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: delete more dead config [puppet] - 10https://gerrit.wikimedia.org/r/748132 (owner: 10Arturo Borrero Gonzalez) [12:03:50] (03PS5) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: delete more dead config [puppet] - 10https://gerrit.wikimedia.org/r/748132 [12:09:23] (03Abandoned) 10Marostegui: jynus,kormat.bashrc: Replace mysql.py with db-mysql [puppet] - 10https://gerrit.wikimedia.org/r/748064 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [12:10:45] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [12:11:19] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:12:39] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:14:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: grid-configurator: refactor function to handle dead config [puppet] - 10https://gerrit.wikimedia.org/r/748116 (owner: 10Arturo Borrero Gonzalez) [12:14:21] (03CR) 10Jelto: "looks mostly good for me. I left some comments" [deployment-charts] - 10https://gerrit.wikimedia.org/r/748703 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [12:14:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: grid-configurator: delete more dead config [puppet] - 10https://gerrit.wikimedia.org/r/748132 (owner: 10Arturo Borrero Gonzalez) [12:15:40] (03CR) 10Awight: "Seems to break puppet on wmflabs, where the new hiera variable is missing." [puppet] - 10https://gerrit.wikimedia.org/r/737898 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [12:21:45] (03PS8) 10D3r1ck01: Define a contact form for Chapter/Thorg application status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024) [12:22:20] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/748711 (owner: 10L10n-bot) [12:24:34] (03CR) 10Abijeet Patro: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/748711 (owner: 10L10n-bot) [12:33:53] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] passwords: Add ladsgroup to the cloud root (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/748699 (owner: 10Ladsgroup) [12:47:22] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) Along with the reimages 2012 (row C) and 2016 got auto-promoted as master candidates, but the VIP used for RAPI access by cookbooks needs to be from row B. I've manually... [13:00:38] (03PS9) 10D3r1ck01: Define a contact form for Chapter/Thorg application status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024) [13:05:56] (03PS2) 10JMeybohm: Add generic probes/metrics networkpolicy to cert-manager/cfssl [deployment-charts] - 10https://gerrit.wikimedia.org/r/748703 (https://phabricator.wikimedia.org/T294560) [13:06:33] (03CR) 10JMeybohm: "Good points, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/748703 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [13:16:38] (03PS1) 10Ladsgroup: auto_schema: Rework upgrade_mysql a bit to reuse code [software] - 10https://gerrit.wikimedia.org/r/748720 (https://phabricator.wikimedia.org/T239814) [13:20:20] (03PS10) 10D3r1ck01: Define a contact form for Chapter/Thorg application status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024) [13:21:48] (03CR) 10jerkins-bot: [V: 04-1] Define a contact form for Chapter/Thorg application status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024) (owner: 10D3r1ck01) [13:24:19] (03PS11) 10D3r1ck01: Define a contact form for Chapter/Thorg application status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024) [13:30:44] (03PS1) 10Muehlenhoff: CAS: Update to 6.4.4.2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/748722 [13:33:30] !log fail over master in codfw to ganeti2021 T296622 [13:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:36] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [13:37:43] PROBLEM - ganeti-wconfd running on ganeti2019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:37:45] (03PS1) 10Ladsgroup: auto_schema: Refactor bash to make it a bit cleaner [software] - 10https://gerrit.wikimedia.org/r/748723 (https://phabricator.wikimedia.org/T288235) [13:43:53] ^ ganeti2019 is expected, logspam after master failover [13:45:50] (03CR) 10Daniel Kinzler: [C: 03+1] "This looks reasonable, and I have confirmed locally that it does not explode." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024) (owner: 10D3r1ck01) [13:47:20] 10SRE, 10observability, 10service-runner, 10serviceops-radar: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 (10akosiaris) >>! In T222795#7578389, @Ottomata wrote: > Oh, is that not related? Anyway, I'm not aware of any alerts on... [13:49:16] 10SRE, 10observability, 10service-runner, 10serviceops-radar: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 (10akosiaris) 05Open→03Resolved a:03akosiaris The patch has been merged but there isn't much point in tracking the... [13:49:23] PROBLEM - Ganeti memory on ganeti2022 is CRITICAL: CRIT Memory 97% used. Largest process: qemu-system-x86 (18149) = 12.6% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [13:51:16] ^ there's an hbal in progress which will fix this [14:00:37] RECOVERY - Ganeti memory on ganeti2022 is OK: OK Memory 78% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [14:13:34] !log installing wireshark security updates on buster [14:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:39] (03PS1) 10Ladsgroup: auto_schema: Automatic detection of active dc [software] - 10https://gerrit.wikimedia.org/r/748726 (https://phabricator.wikimedia.org/T288235) [14:21:18] (03PS2) 10Ladsgroup: auto_schema: Automatic detection of active dc [software] - 10https://gerrit.wikimedia.org/r/748726 (https://phabricator.wikimedia.org/T288235) [14:21:45] (03CR) 10JMeybohm: [C: 03+1] Rakefile: check only client helm version [deployment-charts] - 10https://gerrit.wikimedia.org/r/748701 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [14:24:12] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [14:25:12] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [14:27:05] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10Ottomata) Approved! [14:27:40] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/748703 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:28:26] (03CR) 10JMeybohm: [C: 03+2] Add generic probes/metrics networkpolicy to cert-manager/cfssl [deployment-charts] - 10https://gerrit.wikimedia.org/r/748703 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:32:14] (03Merged) 10jenkins-bot: Add generic probes/metrics networkpolicy to cert-manager/cfssl [deployment-charts] - 10https://gerrit.wikimedia.org/r/748703 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:36:18] (03PS1) 10Esanders: Enable reply tool by default on wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748727 (https://phabricator.wikimedia.org/T297535) [14:37:36] (03CR) 10Esanders: "To be deployed on or after 4th Jan 2022." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748727 (https://phabricator.wikimedia.org/T297535) (owner: 10Esanders) [14:39:35] (03CR) 10Jelto: [C: 03+2] Rakefile: check only client helm version [deployment-charts] - 10https://gerrit.wikimedia.org/r/748701 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [14:39:37] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [14:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:14] (03CR) 10Ottomata: [C: 03+1] Add event stream config for ios.notification_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747993 (https://phabricator.wikimedia.org/T290920) (owner: 10Sharvaniharan) [14:40:33] (03CR) 10Ottomata: [C: 03+1] Add event stream config for android.customize_toolbar_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747991 (https://phabricator.wikimedia.org/T297818) (owner: 10Sharvaniharan) [14:42:53] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:03] (03Merged) 10jenkins-bot: Rakefile: check only client helm version [deployment-charts] - 10https://gerrit.wikimedia.org/r/748701 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [14:44:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/748114 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:48:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. For our internal servers we can also temporarily set profile::apt::mirror to 'deb.debian.org', which would minimise impact in " [dns] - 10https://gerrit.wikimedia.org/r/747933 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway) [14:52:17] (03PS1) 10BBlack: drmrs: include Netbox files for LVS subnets [dns] - 10https://gerrit.wikimedia.org/r/748728 (https://phabricator.wikimedia.org/T282787) [14:53:46] (03CR) 10BBlack: [C: 03+2] drmrs: include Netbox files for LVS subnets [dns] - 10https://gerrit.wikimedia.org/r/748728 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [15:04:05] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [15:07:52] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: disable disk check for docker volumes [puppet] - 10https://gerrit.wikimedia.org/r/748114 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:10:37] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [15:11:06] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [15:12:23] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [15:14:26] RECOVERY - Disk space on gitlab-runner1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=gitlab-runner1001&var-datasource=eqiad+prometheus/ops [15:16:28] PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:18:48] RECOVERY - Disk space on gitlab-runner2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=gitlab-runner2001&var-datasource=codfw+prometheus/ops [15:19:34] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:25:16] (03CR) 10Varac: "Hi!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [15:25:53] (03CR) 10Herron: [C: 04-1] "Am I understanding correctly that this is meant to drop messages matching the from/to addresses in the filter? If so please see comments " [puppet] - 10https://gerrit.wikimedia.org/r/743040 (https://phabricator.wikimedia.org/T132324) (owner: 10Jcrespo) [15:38:55] !log Deploy security patch for T298019 [15:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:55] (03CR) 10Majavah: Kubernetes 1.22 support, update chart version (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [15:42:47] (03PS2) 10Varac: Kubernetes 1.22 support, update chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 [15:42:49] (03PS1) 10Varac: Also support older k8s versions <=1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/748734 [15:50:47] (03CR) 10Varac: Kubernetes 1.22 support, update chart version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [15:58:19] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [15:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:11] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:42] (03PS1) 10JHathaway: profile::apt::mirror: change apt mirror to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/748740 [16:16:01] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [16:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:44] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:29] (03PS1) 10Volans: sre.hosts.provision: disable internal USB ports [cookbooks] - 10https://gerrit.wikimedia.org/r/748741 [16:19:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/748740 (owner: 10JHathaway) [16:21:40] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:26] (03CR) 10JHathaway: [C: 03+2] profile::apt::mirror: change apt mirror to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/748740 (owner: 10JHathaway) [16:27:46] RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [16:28:54] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: disable internal USB ports [cookbooks] - 10https://gerrit.wikimedia.org/r/748741 (owner: 10Volans) [16:32:07] (03Merged) 10jenkins-bot: sre.hosts.provision: disable internal USB ports [cookbooks] - 10https://gerrit.wikimedia.org/r/748741 (owner: 10Volans) [16:34:05] 10SRE, 10MediaWiki-Gerrit-Group-Requests: Grant Access to mediawiki gerrit group for divec - https://phabricator.wikimedia.org/T285931 (10Majavah) 05Open→03Declined Closing due to inactivity. [16:36:08] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host mc2038.mgmt.codfw.wmnet with reboot policy GRACEFUL [16:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:46] (03PS3) 10JHathaway: mirrors.wikimedia.org: point to new mirror [dns] - 10https://gerrit.wikimedia.org/r/747933 (https://phabricator.wikimedia.org/T286898) [16:43:21] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2038.mgmt.codfw.wmnet with reboot policy GRACEFUL [16:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:46] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10bking) [16:52:21] (03PS5) 10Jcrespo: exim: [puppet] - 10https://gerrit.wikimedia.org/r/743040 (https://phabricator.wikimedia.org/T298038) [16:54:14] (03PS1) 10BBlack: drmrs: configure ats-tls params [puppet] - 10https://gerrit.wikimedia.org/r/748746 (https://phabricator.wikimedia.org/T282787) [16:54:16] (03PS1) 10BBlack: cloudgw: add newly-allocated drmrs IPs [puppet] - 10https://gerrit.wikimedia.org/r/748747 (https://phabricator.wikimedia.org/T282787) [16:59:26] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:32] (03CR) 10Majavah: [C: 04-1] "sodium is exempt from the cloud vps egress NAT (see profile::openstack::DEPLOYMENT::cloudgw::dmz_cidr hiera). Those should be either updat" [dns] - 10https://gerrit.wikimedia.org/r/747933 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway) [17:00:47] (03PS6) 10Jcrespo: exim: [puppet] - 10https://gerrit.wikimedia.org/r/743040 (https://phabricator.wikimedia.org/T298038) [17:04:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org [17:04:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc2040.mgmt.codfw.wmnet with reboot policy FORCED [17:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:33] (03PS7) 10Jcrespo: exim: [puppet] - 10https://gerrit.wikimedia.org/r/743040 (https://phabricator.wikimedia.org/T298038) [17:14:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2040.mgmt.codfw.wmnet with reboot policy FORCED [17:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org [17:15:08] (03PS1) 10BBlack: drmrs: configure lvs and public IPs [puppet] - 10https://gerrit.wikimedia.org/r/748752 (https://phabricator.wikimedia.org/T282787) [17:16:04] (03CR) 10jerkins-bot: [V: 04-1] drmrs: configure lvs and public IPs [puppet] - 10https://gerrit.wikimedia.org/r/748752 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [17:17:36] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:20:11] (03PS2) 10BBlack: drmrs: configure lvs and public IPs [puppet] - 10https://gerrit.wikimedia.org/r/748752 (https://phabricator.wikimedia.org/T282787) [17:20:33] mutante: hi - not sure if you're the right person for this type of problem but suddenly https://phab.wiki says it's 404 - not configured? [17:21:37] hauskatze: I wasn't aware we had that domain. But the IP it points to now is not ours. [17:21:56] WMF stopped renewing all those .wiki domains I think [17:22:04] so someone else got that [17:22:08] oh, it worked last week I think [17:22:22] w.wiki is ours iirc [17:22:54] https://dnslytics.com/domain/phab.wiki [17:22:57] hmm, cloudflare [17:23:02] (03PS1) 10BBlack: drmrs: add to global datacenter list [puppet] - 10https://gerrit.wikimedia.org/r/748757 (https://phabricator.wikimedia.org/T282787) [17:23:08] (03CR) 10JMeybohm: "I'll need to update go.mod and vendor when the merged on github are done, but now I think that's not absolutely required to happen before " [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/748143 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [17:23:40] hauskatze: yea, so originally we had all the domains like "en.wiki" and every language code.. but we did not end up using them. so they all expired except we kept w.wiki for the URL shortener ticket [17:24:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:24:11] hauskatze: ask Mukunda too if he was aware of the change [17:25:02] (03CR) 10BBlack: [C: 03+2] drmrs: configure ats-tls params [puppet] - 10https://gerrit.wikimedia.org/r/748746 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [17:25:10] (03CR) 10BBlack: [C: 03+2] cloudgw: add newly-allocated drmrs IPs [puppet] - 10https://gerrit.wikimedia.org/r/748747 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [17:25:57] jhathaway: you have an unmerged puppet patch about mirrors outstanding, safe to merge it with my stuff, or? [17:26:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:26:28] bblack: yes, sorry about that, I keep forgetting [17:26:32] mutante: ack, shall I file a Task? [17:26:35] ok, will merge, thanks! [17:26:48] hauskatze: if you think that domain had actual users.. yes [17:26:54] AFAIK phab.wiki is still used on mailing lists archives, etc. [17:27:02] though I don't think we'll sue it back [17:27:08] but ianal [17:28:25] there is the "domains" phab tag that may or not work to get attention from the person handling them at legal [17:28:45] or I can klaxon people :P [17:28:55] (03PS8) 10Jcrespo: exim: [puppet] - 10https://gerrit.wikimedia.org/r/743040 (https://phabricator.wikimedia.org/T298038) [17:29:01] joking of course [17:30:10] (03PS1) 10Volans: sre.hosts.provision: refactor to be more flexible [cookbooks] - 10https://gerrit.wikimedia.org/r/748761 [17:30:19] hauskatze: can you really? AFAIK it shouldn't let you to do the klaxoning :)) [17:30:26] (cf. https://wikitech.wikimedia.org/wiki/Klaxon#Who_is_allowed_to_send_pages_using_Klaxon?) [17:30:52] (03PS6) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305 [17:30:58] urbanecm: nope, that was a double joke [17:30:59] hauskatze: Updated Date: 2021-07-25T22:33:00.0Z [17:31:11] okay then :) [17:31:25] hauskatze: if that worked last week.. the new owner was nice but changed stuff recently? [17:31:27] (03PS9) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [17:31:36] (03PS5) 10Thiemo Kreuz (WMDE): Streamline/modernize code in MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737857 [17:33:02] whois -h whois.cloudflare.com phab.wiki but it's all redacted [17:34:40] .wiki is the gift that will never stop giving :P [17:42:37] 10SRE, 10Domains: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10MarcoAurelio) [17:43:53] mutante: apparently we don't have anything in our files https://codesearch.wmcloud.org/search/?q=phab%5C.wiki&i=nope&files=&excludeFiles=&repos= [17:44:17] but I've not been paying attention to the git logs for a while so maybe it was removed [17:45:21] (03PS1) 10Esanders: Enable reply tool by default on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748765 (https://phabricator.wikimedia.org/T297533) [17:45:41] bblack: omg, gift.wiki is not even taken. quick! [17:45:49] (03CR) 10Esanders: [C: 04-1] "Timing TBD" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748765 (https://phabricator.wikimedia.org/T297533) (owner: 10Esanders) [17:46:06] (03CR) 10Esanders: [C: 04-1] Enable reply tool by default on wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748727 (https://phabricator.wikimedia.org/T297535) (owner: 10Esanders) [17:46:08] I suspect someone individually bought it as a convenience shortcut and then let it expire [17:46:10] hauskatze: ok, ack, or caching in various places [17:46:26] !log disabling puppet on mail servers T298038 [17:46:27] good theory, legoktm [17:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:36] sounds likely [17:47:03] IIRC the first year of .wiki on namecheap was relatively cheap and then when it's renewal time it got like 3x more expensive [17:47:17] so yea.. make ticket but dunno how to find out who owns it [17:47:26] Just for clarity, I don't know exactly what happened. It used to work, and not it does not so I thought I should drop a note :) [17:47:28] cloudflare won't tell us [17:47:40] or I dunno.. maybe they would [17:47:42] the ICANN price bait-and-switch is so annoying [17:47:45] depending who asks [17:47:46] You could ask [17:47:52] "The People of the State of California to [...] greetings" [17:47:57] They might be nice if someone official asks [17:48:03] ask on wikitech-l maybe and see if the owner comes forward? [17:48:06] (03CR) 10Jcrespo: [C: 03+2] exim: [puppet] - 10https://gerrit.wikimedia.org/r/743040 (https://phabricator.wikimedia.org/T298038) (owner: 10Jcrespo) [17:48:26] yea, that's good, mailing list +1 [17:48:46] (03CR) 10Jcrespo: [C: 03+2] "Herron supervised this, trying to deploy now carefully (puppet disabled on mx hosts right now)" [puppet] - 10https://gerrit.wikimedia.org/r/743040 (https://phabricator.wikimedia.org/T298038) (owner: 10Jcrespo) [17:49:15] searching on phab does not output any meaningful result [17:49:33] unless this happened in a private Space [17:49:42] usually things about $$$ happens over there [17:49:45] e.g. procurement [17:50:04] https://archive.org/search.php?query=https%3A%2F%2Fphab.wiki [17:50:11] The search engine encountered the following error: invalid or no response from Elasticsearch [17:50:31] InternetArchive also working on patching log4j or something ? [17:50:52] anyways, just wanted to link to see when it was created or if it ever had content besides redirect [17:51:34] https://web.archive.org/web/*/https://phab.wiki [17:51:36] this one [17:51:53] but it hasn't archived it ever, so that was that [17:53:40] (03PS1) 10Andrew Bogott: wmcs-cinder-backup-manager.py: get openstack creds from novaadmin.yaml [puppet] - 10https://gerrit.wikimedia.org/r/748768 (https://phabricator.wikimedia.org/T294429) [17:55:24] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup-manager.py: get openstack creds from novaadmin.yaml [puppet] - 10https://gerrit.wikimedia.org/r/748768 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [17:55:31] (03CR) 10Giuseppe Lavagetto: "Overall LGTM - I suggested an improvement to the rewrite rules but you can ignore it; do use https instead of the env variable though, for" [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [17:55:43] searching across phab only pulls up https://phabricator.wikimedia.org/T245472 [17:57:28] (03PS3) 10BBlack: drmrs: lvs/cp puppetization [puppet] - 10https://gerrit.wikimedia.org/r/748752 (https://phabricator.wikimedia.org/T282787) [17:57:30] (03PS2) 10BBlack: drmrs: add to global datacenter list [puppet] - 10https://gerrit.wikimedia.org/r/748757 (https://phabricator.wikimedia.org/T282787) [17:57:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: Upgrade Calico to v3.21.0 [puppet] - 10https://gerrit.wikimedia.org/r/738179 (https://phabricator.wikimedia.org/T292698) (owner: 10Majavah) [17:58:06] !log reloading exim configuration with extra rule on mx2001 T298038 [17:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:20] oh I forgot to add 20after4 to the Task [18:00:41] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: drop APT repositories NAT exception [puppet] - 10https://gerrit.wikimedia.org/r/748771 (https://phabricator.wikimedia.org/T298042) [18:04:57] 10SRE, 10Domains, 10Traffic: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10MarcoAurelio) [18:06:56] PROBLEM - Host ganeti2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:11:27] (03PS1) 10Arturo Borrero Gonzalez: cloud: drop APT repositories NAT exception [homer/public] - 10https://gerrit.wikimedia.org/r/748774 (https://phabricator.wikimedia.org/T298042) [18:12:23] (03PS1) 10Majavah: Add drmrs addresses [homer/public] - 10https://gerrit.wikimedia.org/r/748775 (https://phabricator.wikimedia.org/T282787) [18:12:25] (03CR) 10Arturo Borrero Gonzalez: "related change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/748771" [homer/public] - 10https://gerrit.wikimedia.org/r/748774 (https://phabricator.wikimedia.org/T298042) (owner: 10Arturo Borrero Gonzalez) [18:12:32] (03CR) 10Arturo Borrero Gonzalez: "related change: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/748774" [puppet] - 10https://gerrit.wikimedia.org/r/748771 (https://phabricator.wikimedia.org/T298042) (owner: 10Arturo Borrero Gonzalez) [18:13:12] RECOVERY - Host ganeti2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.62 ms [18:18:42] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:19:04] (03PS1) 10Arturo Borrero Gonzalez: wmcs: don't print to stdout security groups when ensuring [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/748777 [18:19:43] !log mforns@deploy1002 Started deploy [analytics/refinery@e29c9f0]: Add anomaly detection queries [analytics/refinery@e29c9f0] [18:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:58] (03PS1) 10Esanders: Enable reply tool by default on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748780 (https://phabricator.wikimedia.org/T297534) [18:20:02] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:20:23] (03CR) 10Esanders: [C: 04-1] "To be deployed on or after 4th Jan" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748780 (https://phabricator.wikimedia.org/T297534) (owner: 10Esanders) [18:21:10] (03CR) 10Legoktm: passwords: Add ladsgroup to the cloud root (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/748699 (owner: 10Ladsgroup) [18:25:26] (03CR) 10Ppchelko: [C: 04-1] api-gateway: allow discovery services to set custom rate limits (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [18:28:06] (03PS1) 10AOkoth: changeprop: increase memory limit for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/748781 (https://phabricator.wikimedia.org/T293729) [18:29:06] (03PS2) 10Arturo Borrero Gonzalez: wmcs: don't print to stdout security groups when ensuring [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/748777 [18:30:05] (03CR) 10Dzahn: [C: 03+1] "speaking as an SRE with cloud root, it was occasionally helpful if you wanted to contribute or fix puppet classes used by other people on " [labs/private] - 10https://gerrit.wikimedia.org/r/748699 (owner: 10Ladsgroup) [18:30:53] (03CR) 10AOkoth: "Hi," [deployment-charts] - 10https://gerrit.wikimedia.org/r/748781 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [18:33:29] 10SRE, 10Domains, 10Traffic: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Dzahn) ` phab.wiki has address 104.21.94.13 phab.wiki has address 172.67.218.54 phab.wiki has IPv6 address 2606:4700:3033::6815:5e0d phab.wiki has IP... [18:37:22] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [18:37:23] 10SRE, 10Domains, 10Traffic: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Dzahn) Hi @MarcoAurelio thanks for the report! I wasn't aware of this domain name and after searching in some places (my own email, Phabricator, D... [18:38:06] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [18:38:53] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [18:38:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [18:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:46] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:41:48] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [18:41:58] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:42:32] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [18:42:51] !log mforns@deploy1002 Finished deploy [analytics/refinery@e29c9f0]: Add anomaly detection queries [analytics/refinery@e29c9f0] (duration: 23m 07s) [18:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:17] !log mforns@deploy1002 Started deploy [analytics/refinery@e29c9f0] (thin): Add anomaly detection queries THIN [analytics/refinery@e29c9f0] [18:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:25] !log mforns@deploy1002 Finished deploy [analytics/refinery@e29c9f0] (thin): Add anomaly detection queries THIN [analytics/refinery@e29c9f0] (duration: 00m 07s) [18:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:53] !log mforns@deploy1002 Started deploy [analytics/refinery@e29c9f0] (hadoop-test): Add anomaly detection queries TEST [analytics/refinery@e29c9f0] [18:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:30] 10SRE, 10Domains, 10Traffic: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Dzahn) {F34890931} [18:44:50] (03CR) 10Ayounsi: [C: 03+1] cloud: drop APT repositories NAT exception [homer/public] - 10https://gerrit.wikimedia.org/r/748774 (https://phabricator.wikimedia.org/T298042) (owner: 10Arturo Borrero Gonzalez) [18:44:54] (03PS1) 10Jcrespo: Revert "exim:" [puppet] - 10https://gerrit.wikimedia.org/r/748279 [18:45:18] 10SRE, 10Domains, 10Phabricator, 10Traffic: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Dzahn) [18:46:08] !log mforns@deploy1002 Started deploy [analytics/refinery@e29c9f0] (hadoop-test): Add anomaly detection queries TEST [analytics/refinery@e29c9f0] [18:46:10] (03CR) 10Jcrespo: [C: 03+2] Revert "exim:" [puppet] - 10https://gerrit.wikimedia.org/r/748279 (owner: 10Jcrespo) [18:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:21] 10SRE, 10Domains, 10Phabricator, 10Traffic: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10MarcoAurelio) @Dzahn Searching through my inbox, the first occurence of phab.wiki is from 2019, so it has been around for a short wh... [18:46:57] (03CR) 10Ayounsi: "Thanks, 1 mistake, LGTM otherwise." [homer/public] - 10https://gerrit.wikimedia.org/r/748775 (https://phabricator.wikimedia.org/T282787) (owner: 10Majavah) [18:47:27] !log reenabling puppet on mx servers T298038 [18:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:06] 10SRE, 10Domains, 10Phabricator, 10Traffic: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Dzahn) Let me try to find the original old list of all the .wiki domains WMF once had. Will get back to you. [18:49:25] mutante or legoktm: If an e-mail is to be sent, I'd try first in ops-l. [18:56:19] (03PS2) 10Majavah: Add drmrs addresses [homer/public] - 10https://gerrit.wikimedia.org/r/748775 (https://phabricator.wikimedia.org/T282787) [18:56:43] 10SRE, 10Domains, 10Phabricator, 10Traffic: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Dzahn) @MarcoAurelio I still have an email from 2014 (!sic) that talks about .wiki domains and has a list that WMF was going to get.... [18:56:47] (03CR) 10Majavah: Add drmrs addresses (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/748775 (https://phabricator.wikimedia.org/T282787) (owner: 10Majavah) [18:58:08] hauskatze: I can see a list as old as 2014 and a domain renewal as new as January 2021 for .wiki domains owned by WMF. just .."phab" is not on either of the lists [18:58:53] mutante: I am not sure how to prove I am not lying that that domain existed :-) [18:59:27] hauskatze: I very much believe you it existed, I even attached a screenshot how it pops up in Google [18:59:37] hauskatze: it's just me trying to proof to you it wasn't under our control [18:59:41] that it stopped working [18:59:43] ah, missed that [18:59:53] * hauskatze should refresh more often [19:00:12] (03PS1) 10Legoktm: Expand $wgLocalVirtualHosts, update documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748786 [19:00:14] (03PS1) 10Legoktm: Enable $wgLocalHTTPProxy on Kubernetes, regardless of group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748787 (https://phabricator.wikimedia.org/T288848) [19:00:25] WMF does still pay money for a bunch .wiki domains [19:00:34] just this one is not among them [19:00:54] and if they are renewed yearly and since it was paid last January [19:01:20] then it should also not be a special case that expired mid year.. even if it _was_ on the list, which it isn't [19:03:04] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:05:25] hauskatze: could also be that Phacility is the owner and redirected to us as the largest public Phab install [19:05:29] 10SRE, 10Domains, 10Phabricator, 10Traffic: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Aklapper) I could neither find nor remember any trace of the domain `phab.wiki` either here... [19:06:02] (03CR) 10Jdlrobson: [C: 04-1] Deploy sticky header to pilot wikis, launch A/B test. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747981 (https://phabricator.wikimedia.org/T295976) (owner: 10Clare Ming) [19:08:52] 10SRE, 10Domains, 10Phabricator, 10Traffic: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Dzahn) But I can confirm it does pop up in Google search results. maybe it's possible Phacility owns/owned that and pointed to us as... [19:12:17] mutante: hauskatze: I think I found the operator of phab.wiki: https://wikitech.wikimedia.org/wiki/User:Revi/phab.wiki [19:13:18] 10SRE, 10Domains, 10Phabricator, 10Traffic: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Zabe) Using the global search tool I found https://wikitech.wikimedia.org/wiki/User:Revi/phab.wiki. [19:14:26] (03PS4) 10BBlack: drmrs: lvs/cp puppetization [puppet] - 10https://gerrit.wikimedia.org/r/748752 (https://phabricator.wikimedia.org/T282787) [19:14:28] (03PS3) 10BBlack: drmrs: add to global datacenter list [puppet] - 10https://gerrit.wikimedia.org/r/748757 (https://phabricator.wikimedia.org/T282787) [19:15:31] (03CR) 10RLazarus: [C: 03+2] Add a pod_name column to ActiveContainerImage (031 comment) [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/747881 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [19:16:09] zabe: heh, I learned from revi that the domain existed, but I didn't knew he was the owner? [19:16:20] zabe: very good! thank you :) [19:17:35] yw :) [19:17:45] (03Merged) 10jenkins-bot: Add a pod_name column to ActiveContainerImage [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/747881 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [19:18:34] !log mforns@deploy1002 Started deploy [analytics/refinery@e29c9f0] (hadoop-test): Add anomaly detection queries TEST [analytics/refinery@e29c9f0] [19:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:43] !log mforns@deploy1002 Finished deploy [analytics/refinery@e29c9f0] (hadoop-test): Add anomaly detection queries TEST [analytics/refinery@e29c9f0] (duration: 00m 09s) [19:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:52] (03CR) 10RLazarus: [C: 03+2] Fix --clusters command line parsing and add tests [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748232 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [19:19:19] 10SRE, 10Fundraising-Backlog, 10Thank-You-Page, 10Wikimedia-Apache-configuration, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10JMinor) 05Open→03Resolved Ok, I think we finally nailed this one down. Thanks all. [19:20:35] 10SRE, 10Domains, 10Phabricator, 10Traffic: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Dzahn) Great detective work @Zabe, Thanks! added @revi Hi @Revi :) [19:20:50] (03PS5) 10BBlack: drmrs: lvs/cp puppetization [puppet] - 10https://gerrit.wikimedia.org/r/748752 (https://phabricator.wikimedia.org/T282787) [19:20:52] (03PS4) 10BBlack: drmrs: add to global datacenter list [puppet] - 10https://gerrit.wikimedia.org/r/748757 (https://phabricator.wikimedia.org/T282787) [19:20:54] (03PS1) 10BBlack: drmrs: ncredir puppetization [puppet] - 10https://gerrit.wikimedia.org/r/748790 [19:20:56] (03Merged) 10jenkins-bot: Fix --clusters command line parsing and add tests [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748232 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [19:21:15] (03PS2) 10BBlack: drmrs: ncredir puppetization [puppet] - 10https://gerrit.wikimedia.org/r/748790 (https://phabricator.wikimedia.org/T282787) [19:22:40] (03CR) 10jerkins-bot: [V: 04-1] drmrs: ncredir puppetization [puppet] - 10https://gerrit.wikimedia.org/r/748790 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [19:23:27] 10SRE, 10Domains, 10Phabricator, 10Traffic: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Dzahn) https://wikitech.wikimedia.org/wiki/User_talk:Revi/phab.wiki [19:24:55] !log mforns@deploy1002 Started deploy [analytics/refinery@e29c9f0] (hadoop-test): Add anomaly detection queries TEST [analytics/refinery@e29c9f0] [19:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:00] !log mforns@deploy1002 Finished deploy [analytics/refinery@e29c9f0] (hadoop-test): Add anomaly detection queries TEST [analytics/refinery@e29c9f0] (duration: 00m 05s) [19:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:39] (03CR) 10BBlack: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/748790 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [19:26:13] !log mforns@deploy1002 Started deploy [analytics/refinery@e29c9f0] (hadoop-test): Add anomaly detection queries TEST [analytics/refinery@e29c9f0] [19:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:17] !log mforns@deploy1002 Finished deploy [analytics/refinery@e29c9f0] (hadoop-test): Add anomaly detection queries TEST [analytics/refinery@e29c9f0] (duration: 00m 04s) [19:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:24] 10SRE, 10Domains, 10Phabricator, 10Traffic: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Sunpriat2) @revi timestamps: https://web.archive.org/web/*/https://phab.wiki/* 1 https://web.archive.org/web/2019*/https://phab.wiki... [19:26:29] (03CR) 10Legoktm: [C: 03+2] "This will only affect group0 wikis and the Kubernetes deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748786 (owner: 10Legoktm) [19:26:38] (03CR) 10Legoktm: [C: 03+2] "Kubernetes-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748787 (https://phabricator.wikimedia.org/T288848) (owner: 10Legoktm) [19:27:32] (03Merged) 10jenkins-bot: Expand $wgLocalVirtualHosts, update documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748786 (owner: 10Legoktm) [19:27:34] (03Merged) 10jenkins-bot: Enable $wgLocalHTTPProxy on Kubernetes, regardless of group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748787 (https://phabricator.wikimedia.org/T288848) (owner: 10Legoktm) [19:28:52] PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:28:55] I'm going to do the appserver syncs after Kubernetes auto-deploys [19:30:26] (03PS1) 10RLazarus: Release v0.0.3 [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748791 [19:30:43] !log otto@deploy1002 Started deploy [analytics/refinery@e29c9f0] (hadoop-test): (no justification provided) [19:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:05] !log otto@deploy1002 Finished deploy [analytics/refinery@e29c9f0] (hadoop-test): (no justification provided) (duration: 00m 22s) [19:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:20] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 84 probes of 641 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:32:33] !log otto@deploy1002 Started deploy [analytics/refinery@e29c9f0] (hadoop-test): (no justification provided) [19:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:16] (03CR) 10RLazarus: [C: 03+2] Release v0.0.3 [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748791 (owner: 10RLazarus) [19:33:50] 10SRE, 10Domains, 10Phabricator: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Dzahn) [19:34:16] 10SRE, 10Domains, 10Phabricator, 10serviceops-radar: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Dzahn) [19:34:53] (03Merged) 10jenkins-bot: Release v0.0.3 [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748791 (owner: 10RLazarus) [19:37:40] !log otto@deploy1002 Finished deploy [analytics/refinery@e29c9f0] (hadoop-test): (no justification provided) (duration: 05m 07s) [19:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:28] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 49 probes of 641 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:42:38] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 32.01 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:44:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:46] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Expand $wgLocalVirtualHosts, enable $wgLocalHTTPProxy on Kubernetes (duration: 00m 57s) [19:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:07] (03CR) 10JMeybohm: [C: 03+1] changeprop: increase memory limit for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/748781 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [19:48:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:07] 10SRE, 10MW-on-K8s, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.20; 2021-08-23), and 2 others: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) With the above two patches, I was able to successfully load cross-wiki notifications, aka make cross-wiki... [20:02:37] !log rzl@apt1001:~$ sudo -i reprepro -C main include buster-wikimedia /home/rzl/python3-imagecatalog/imagecatalog_0.0.3-1_amd64.changes [20:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:16] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:04:00] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@febf1c5] (hadoop-test): (no justification provided) [20:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:12] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@febf1c5] (hadoop-test): (no justification provided) (duration: 00m 11s) [20:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:08] !log otto@deploy1002 Started deploy [airflow-dags/analytics@febf1c5]: (no justification provided) [20:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:14] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@febf1c5]: (no justification provided) (duration: 00m 06s) [20:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:18] !log otto@deploy1002 Started deploy [airflow-dags/analytics@febf1c5] (hadoop-test): (no justification provided) [20:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:49] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@febf1c5] (hadoop-test): (no justification provided) (duration: 00m 30s) [20:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:36] !log otto@deploy1002 Started deploy [airflow-dags/analytics@febf1c5] (hadoop-test): (no justification provided) [20:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:43] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@febf1c5] (hadoop-test): (no justification provided) (duration: 00m 07s) [20:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:14] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@febf1c5] (hadoop-test): (no justification provided) [20:12:17] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@febf1c5] (hadoop-test): (no justification provided) (duration: 00m 03s) [20:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:44] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 97.96 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:25:27] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Zabe) [20:25:30] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:25:44] (03CR) 10RLazarus: [C: 03+2] imagecatalog: Pass cluster names along with config paths [puppet] - 10https://gerrit.wikimedia.org/r/747685 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [20:29:56] RECOVERY - SSH on db2086.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:30:22] (03PS1) 10RLazarus: imagecatalog: Puppet spelling correction, s/str/String/ [puppet] - 10https://gerrit.wikimedia.org/r/748799 (https://phabricator.wikimedia.org/T287130) [20:33:46] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10wiki_willy) a:03Cmjohnson [20:39:16] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10Reedy) [20:51:27] (03PS1) 10RLazarus: Revert "imagecatalog: Pass cluster names along with config paths" [puppet] - 10https://gerrit.wikimedia.org/r/748280 [20:53:12] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33057/console" [puppet] - 10https://gerrit.wikimedia.org/r/748280 (owner: 10RLazarus) [20:53:29] (03CR) 10RLazarus: [V: 03+1 C: 03+2] Revert "imagecatalog: Pass cluster names along with config paths" [puppet] - 10https://gerrit.wikimedia.org/r/748280 (owner: 10RLazarus) [20:59:43] (03PS1) 10Ottomata: Add analytics-test-hive connection for airflow-analytics-test instance [puppet] - 10https://gerrit.wikimedia.org/r/748804 (https://phabricator.wikimedia.org/T295201) [21:00:40] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33058/console" [puppet] - 10https://gerrit.wikimedia.org/r/748804 (https://phabricator.wikimedia.org/T295201) (owner: 10Ottomata) [21:04:05] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Add analytics-test-hive connection for airflow-analytics-test instance [puppet] - 10https://gerrit.wikimedia.org/r/748804 (https://phabricator.wikimedia.org/T295201) (owner: 10Ottomata) [21:21:00] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 21.95 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:21:17] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 (2.44.10) - https://phabricator.wikimedia.org/T193352 (10JoKalliauer) [21:23:30] PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:25:04] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:47:12] !log deleting the 'exported' rackspace container in IAD -- pretty sure this is left over from wikitech-static DC migration [21:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:51] (03PS1) 10Herron: exim: update system filter [puppet] - 10https://gerrit.wikimedia.org/r/748820 [21:52:27] (03CR) 10jerkins-bot: [V: 04-1] exim: update system filter [puppet] - 10https://gerrit.wikimedia.org/r/748820 (owner: 10Herron) [21:52:56] (03PS2) 10Herron: exim: update system filter [puppet] - 10https://gerrit.wikimedia.org/r/748820 (https://phabricator.wikimedia.org/T298038) [22:17:37] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) [22:18:24] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff complete [22:18:47] 10SRE, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): wikitech-static down - https://phabricator.wikimedia.org/T295266 (10Andrew) i created a tentative (and private) procurement ticket about this issue, here: T298052 [22:19:47] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) 05Resolved→03Open Thanks Papaul for the swift turnarounds, much appreciated. [22:39:25] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.97`. Pre-deploy tests passing on canary `wdqs1003` [22:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:05] (Side note: there's no deploys scheduld this week but we need to get out a `logback` version bump wrt all the recent CVEs, thus the deploy) [22:41:30] !log bking@deploy1002 Started deploy [wdqs/wdqs@81ee634]: 0.3.97 [22:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:50] (03CR) 10Dzahn: wdqs: switch GUI deployment from latest to present (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [22:48:01] Tests passed on canary, proceeding with wqds deploy [22:48:20] !log [WDQS Deploy] Tests passing following deploy of `0.3.97` on canary `wdqs1003`; proceeding to rest of fleet [22:48:41] !log [WDQS Deploy] Tests passing following deploy of `0.3.97` on canary `wdqs1003`; proceeding to rest of fleet [22:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:08] (03CR) 10Dzahn: wdqs: switch GUI deployment from latest to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [22:50:53] !log bking@deploy1002 Finished deploy [wdqs/wdqs@81ee634]: 0.3.97 (duration: 09m 22s) [22:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:28] PROBLEM - WDQS high update lag on wdqs1006 is CRITICAL: 5.566e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [23:03:40] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [23:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:59] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [23:04:27] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [23:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:32] !log [WDQS] `ryankemper@wdqs1006:~$ sudo depool` (catching up on ~14.5 hours of lag) [23:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:43] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [23:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:28] !log bking@deploy1002 Started deploy [wdqs/wdqs@81ee634] (wcqs): Deploy 0.3.97 to WCQS [23:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:11] (03CR) 10Thcipriani: [C: 03+1] admin: add approver for the "restricted" group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747463 (owner: 10MVernon) [23:25:52] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:26:15] !log bking@deploy1002 Finished deploy [wdqs/wdqs@81ee634] (wcqs): Deploy 0.3.97 to WCQS (duration: 02m 46s) [23:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:51] !log [WCQS Deploy] Deploy complete of version `0.3.97` [23:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:24] RECOVERY - WDQS high update lag on wdqs1006 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.07e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [23:56:14] (03PS2) 10RLazarus: imagecatalog: Pass cluster names along with config paths [puppet] - 10https://gerrit.wikimedia.org/r/748799 (https://phabricator.wikimedia.org/T287130) [23:57:37] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33059/console" [puppet] - 10https://gerrit.wikimedia.org/r/748799 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [23:57:56] 10SRE: Allow Wikimedia Maps usage on wikijournal.org - https://phabricator.wikimedia.org/T297948 (10BilalShirwani) [23:58:40] 10SRE-swift-storage, 10Observability-Metrics, 10serviceops: thanos-be hosts filing up root filesystem with logs - https://phabricator.wikimedia.org/T297959 (10BilalShirwani) [23:59:27] 10SRE: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10BilalShirwani)