[00:10:33] (03PS1) 10Urbanecm: testwiki: Enable conditional defaults for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994857 (https://phabricator.wikimedia.org/T353225) [00:10:50] (03CR) 10Urbanecm: [C: 03+2] testwiki: Enable conditional defaults for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994857 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [00:11:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994857 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [00:11:51] (03Merged) 10jenkins-bot: testwiki: Enable conditional defaults for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994857 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [00:12:13] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:994857|testwiki: Enable conditional defaults for 4 Echo properties (T353225)]] [00:12:32] T353225: Echo: Make use of conditional user defaults - https://phabricator.wikimedia.org/T353225 [00:13:45] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:994857|testwiki: Enable conditional defaults for 4 Echo properties (T353225)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:13:50] !log urbanecm@deploy2002 urbanecm: Continuing with sync [00:20:22] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:994857|testwiki: Enable conditional defaults for 4 Echo properties (T353225)]] (duration: 08m 09s) [00:20:42] T353225: Echo: Make use of conditional user defaults - https://phabricator.wikimedia.org/T353225 [00:39:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/994775 [00:39:11] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/994775 (owner: 10TrainBranchBot) [00:42:14] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:24] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:48] (03PS1) 10Cwhite: beta-logs: configure no benthos instances [puppet] - 10https://gerrit.wikimedia.org/r/994776 (https://phabricator.wikimedia.org/T355836) [00:52:33] (03CR) 10Cwhite: [C: 03+2] beta-logs: configure no benthos instances [puppet] - 10https://gerrit.wikimedia.org/r/994776 (https://phabricator.wikimedia.org/T355836) (owner: 10Cwhite) [01:00:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/994775 (owner: 10TrainBranchBot) [01:12:17] (03PS1) 10Dzahn: add DMARC record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/994864 (https://phabricator.wikimedia.org/T355776) [01:31:06] 10SRE, 10Bitu, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10bd808) [01:31:13] 10SRE, 10Infrastructure-Foundations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815 (10bd808) [02:08:47] (03PS1) 10Daimona Eaytoy: beta: Configure Fluxx endpoint for WikimediaCampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994866 (https://phabricator.wikimedia.org/T347894) [02:13:20] (03PS1) 10Daimona Eaytoy: private/readme.php: Add stubs for WikimediaCampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994867 (https://phabricator.wikimedia.org/T347894) [02:13:56] (03PS2) 10Daimona Eaytoy: beta: Configure Fluxx endpoint for WikimediaCampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994866 (https://phabricator.wikimedia.org/T347894) [02:13:59] (03PS2) 10Daimona Eaytoy: private/readme.php: Add stubs for WikimediaCampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994867 (https://phabricator.wikimedia.org/T347894) [02:14:13] (03PS2) 10Daimona Eaytoy: Update commonsettings-labs to enable WikimediaCampaignEvents extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994179 (https://phabricator.wikimedia.org/T347894) (owner: 10Mhorsey) [02:39:27] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:27] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:16:58] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:18:30] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:24:52] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:24] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:54] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (netboxdb1002, ...), Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:33:48] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:20] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:51:56] (03CR) 10Marostegui: [C: 03+2] db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994611 (https://phabricator.wikimedia.org/T356235) (owner: 10Marostegui) [05:52:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es5 T356235 [05:52:19] T356235: Switchover es5 codfw master es2024 -> es2023 - https://phabricator.wikimedia.org/T356235 [05:52:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es5 T356235 [05:52:40] (03Merged) 10jenkins-bot: db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994611 (https://phabricator.wikimedia.org/T356235) (owner: 10Marostegui) [05:52:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2023 with weight 0 T356235', diff saved to https://phabricator.wikimedia.org/P56011 and previous config saved to /var/cache/conftool/dbconfig/20240201-055240-marostegui.json [05:53:54] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:994611|db-production.php: Disable writes on es5 (T356235)]] [05:55:33] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:994611|db-production.php: Disable writes on es5 (T356235)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [05:55:40] !log marostegui@deploy2002 marostegui: Continuing with sync [05:57:10] (03CR) 10Marostegui: mariadb: Switchover es5 master [puppet] - 10https://gerrit.wikimedia.org/r/994731 (https://phabricator.wikimedia.org/T356235) (owner: 10Marostegui) [05:57:12] (03CR) 10Marostegui: [C: 03+2] mariadb: Switchover es5 master [puppet] - 10https://gerrit.wikimedia.org/r/994731 (https://phabricator.wikimedia.org/T356235) (owner: 10Marostegui) [05:58:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2099.codfw.wmnet with reason: Maintenance [05:59:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2099.codfw.wmnet with reason: Maintenance [06:00:41] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994720 [06:02:13] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:994611|db-production.php: Disable writes on es5 (T356235)]] (duration: 08m 19s) [06:02:22] T356235: Switchover es5 codfw master es2024 -> es2023 - https://phabricator.wikimedia.org/T356235 [06:07:58] !log Starting es4 codfw failover from es2024 to es2023 - T356235 [06:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:02] T356235: Switchover es5 codfw master es2024 -> es2023 - https://phabricator.wikimedia.org/T356235 [06:08:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2023 to es5 primary T356235', diff saved to https://phabricator.wikimedia.org/P56012 and previous config saved to /var/cache/conftool/dbconfig/20240201-060853-root.json [06:09:00] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:10:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2024 T356235', diff saved to https://phabricator.wikimedia.org/P56013 and previous config saved to /var/cache/conftool/dbconfig/20240201-061041-root.json [06:10:56] (03CR) 10Marostegui: wmnet: Update CNAME for es5 [dns] - 10https://gerrit.wikimedia.org/r/994730 (https://phabricator.wikimedia.org/T356235) (owner: 10Marostegui) [06:11:00] (03PS2) 10Marostegui: wmnet: Update CNAME for es5 [dns] - 10https://gerrit.wikimedia.org/r/994730 (https://phabricator.wikimedia.org/T356235) [06:12:37] (03CR) 10Marostegui: [C: 03+2] wmnet: Update CNAME for es5 [dns] - 10https://gerrit.wikimedia.org/r/994730 (https://phabricator.wikimedia.org/T356235) (owner: 10Marostegui) [06:12:50] (03CR) 10Marostegui: [C: 03+2] Revert "db-production.php: Disable writes on es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994720 (owner: 10Marostegui) [06:13:33] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994720 (owner: 10Marostegui) [06:13:58] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:994720|Revert "db-production.php: Disable writes on es5"]] [06:14:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 1%: After switchover ', diff saved to https://phabricator.wikimedia.org/P56014 and previous config saved to /var/cache/conftool/dbconfig/20240201-061449-root.json [06:15:28] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:994720|Revert "db-production.php: Disable writes on es5"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:15:46] !log marostegui@deploy2002 marostegui: Continuing with sync [06:16:10] (03PS1) 10Marostegui: mariadb: Promote pc2014 to pc1 master [puppet] - 10https://gerrit.wikimedia.org/r/994878 (https://phabricator.wikimedia.org/T356068) [06:17:36] (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994879 (https://phabricator.wikimedia.org/T356068) [06:21:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2106.codfw.wmnet with reason: Maintenance [06:21:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2106.codfw.wmnet with reason: Maintenance [06:21:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2106 (T355609)', diff saved to https://phabricator.wikimedia.org/P56015 and previous config saved to /var/cache/conftool/dbconfig/20240201-062128-marostegui.json [06:21:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: Primary switchover pc1 T356068 [06:21:33] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:21:36] T356068: Switchover pc1 master pc2011 -> pc2014 - https://phabricator.wikimedia.org/T356068 [06:21:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: Primary switchover pc1 T356068 [06:22:09] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:994720|Revert "db-production.php: Disable writes on es5"]] (duration: 08m 10s) [06:22:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:22:33] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10Marostegui) [06:24:35] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994879 (https://phabricator.wikimedia.org/T356068) (owner: 10Marostegui) [06:25:22] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994879 (https://phabricator.wikimedia.org/T356068) (owner: 10Marostegui) [06:25:51] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:994879|ProductionServices.php: Promote pc2014 to pc1 master (T356068)]] [06:27:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote pc2014 to pc1 master [puppet] - 10https://gerrit.wikimedia.org/r/994878 (https://phabricator.wikimedia.org/T356068) (owner: 10Marostegui) [06:27:27] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:994879|ProductionServices.php: Promote pc2014 to pc1 master (T356068)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:27:30] T356068: Switchover pc1 master pc2011 -> pc2014 - https://phabricator.wikimedia.org/T356068 [06:28:10] !log marostegui@deploy2002 marostegui: Continuing with sync [06:29:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 5%: After switchover ', diff saved to https://phabricator.wikimedia.org/P56016 and previous config saved to /var/cache/conftool/dbconfig/20240201-062955-root.json [06:34:45] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:994879|ProductionServices.php: Promote pc2014 to pc1 master (T356068)]] (duration: 08m 53s) [06:34:48] T356068: Switchover pc1 master pc2011 -> pc2014 - https://phabricator.wikimedia.org/T356068 [06:36:20] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Marostegui) [06:37:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:45:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 10%: After switchover ', diff saved to https://phabricator.wikimedia.org/P56017 and previous config saved to /var/cache/conftool/dbconfig/20240201-064500-root.json [06:45:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T355609)', diff saved to https://phabricator.wikimedia.org/P56018 and previous config saved to /var/cache/conftool/dbconfig/20240201-064511-marostegui.json [06:45:16] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:47:30] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 (10Marostegui) [06:48:37] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Marostegui) pc2011 is no longer a master, this can be done anytime as the host isn't used. [06:49:54] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10Marostegui) es2024 is no longer a master. [06:52:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/994794 (https://phabricator.wikimedia.org/T356305) (owner: 10Volans) [06:53:07] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/994791 (owner: 10Ssingh) [06:59:45] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2107 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/994777 (https://phabricator.wikimedia.org/T356374) [06:59:49] (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/994778 (https://phabricator.wikimedia.org/T356374) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240201T0700) [07:00:05] kormat, marostegui, and Amir1: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240201T0700). nyaa~ [07:00:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 25%: After switchover ', diff saved to https://phabricator.wikimedia.org/P56019 and previous config saved to /var/cache/conftool/dbconfig/20240201-070005-root.json [07:00:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P56020 and previous config saved to /var/cache/conftool/dbconfig/20240201-070018-marostegui.json [07:00:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s2 T356374 [07:00:48] T356374: Switchover s2 master (db2104 -> db2107) - https://phabricator.wikimedia.org/T356374 [07:00:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2107 with weight 0 T356374', diff saved to https://phabricator.wikimedia.org/P56021 and previous config saved to /var/cache/conftool/dbconfig/20240201-070057-marostegui.json [07:01:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s2 T356374 [07:01:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [07:01:25] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2107 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/994777 (https://phabricator.wikimedia.org/T356374) (owner: 10Gerrit maintenance bot) [07:01:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [07:01:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [07:02:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [07:08:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [07:08:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet [07:12:30] (03PS1) 10Muehlenhoff: Failover debmonitor to debmonitor1003 [dns] - 10https://gerrit.wikimedia.org/r/994881 (https://phabricator.wikimedia.org/T241049) [07:13:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [07:13:38] 10SRE, 10Infrastructure-Foundations: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10ops-monitoring-bot) Draining ganeti-test2002.codfw.wmnet of running VMs [07:14:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [07:15:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 50%: After switchover ', diff saved to https://phabricator.wikimedia.org/P56022 and previous config saved to /var/cache/conftool/dbconfig/20240201-071510-root.json [07:15:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P56023 and previous config saved to /var/cache/conftool/dbconfig/20240201-071524-marostegui.json [07:17:53] !log Starting s2 codfw failover from db2104 to db2107 - T356374 [07:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:57] T356374: Switchover s2 master (db2104 -> db2107) - https://phabricator.wikimedia.org/T356374 [07:18:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s2 codfw as read-only for maintenance - T356374', diff saved to https://phabricator.wikimedia.org/P56024 and previous config saved to /var/cache/conftool/dbconfig/20240201-071807-marostegui.json [07:18:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2107 to s2 primary and set section read-write T356374', diff saved to https://phabricator.wikimedia.org/P56025 and previous config saved to /var/cache/conftool/dbconfig/20240201-071831-marostegui.json [07:18:53] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/994778 (https://phabricator.wikimedia.org/T356374) (owner: 10Gerrit maintenance bot) [07:19:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2104 T356374', diff saved to https://phabricator.wikimedia.org/P56026 and previous config saved to /var/cache/conftool/dbconfig/20240201-071934-root.json [07:22:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [07:22:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet [07:24:42] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Marostegui) db2104 is no longer a master [07:24:54] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Marostegui) [07:25:05] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Marostegui) [07:26:19] PROBLEM - ganeti-wconfd running on ganeti-test2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [07:27:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 1%: After switchover', diff saved to https://phabricator.wikimedia.org/P56027 and previous config saved to /var/cache/conftool/dbconfig/20240201-072713-root.json [07:28:24] (03PS1) 10Slyngshede: debmonitorr: switch over to new Bookworm hosts. [dns] - 10https://gerrit.wikimedia.org/r/994884 [07:28:39] (03PS2) 10Slyngshede: debmonitor: switch over to new Bookworm hosts. [dns] - 10https://gerrit.wikimedia.org/r/994884 [07:29:29] (03CR) 10Muehlenhoff: debmonitor: switch over to new Bookworm hosts. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/994884 (owner: 10Slyngshede) [07:30:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 75%: After switchover ', diff saved to https://phabricator.wikimedia.org/P56028 and previous config saved to /var/cache/conftool/dbconfig/20240201-073015-root.json [07:30:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T355609)', diff saved to https://phabricator.wikimedia.org/P56029 and previous config saved to /var/cache/conftool/dbconfig/20240201-073031-marostegui.json [07:30:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [07:30:35] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:30:36] !log installing openjdk-11 security updates [07:30:38] (03PS3) 10Slyngshede: debmonitor: switch over to new Bookworm hosts. [dns] - 10https://gerrit.wikimedia.org/r/994884 [07:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:45] (03CR) 10Slyngshede: debmonitor: switch over to new Bookworm hosts. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/994884 (owner: 10Slyngshede) [07:30:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [07:30:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2110 (T355609)', diff saved to https://phabricator.wikimedia.org/P56030 and previous config saved to /var/cache/conftool/dbconfig/20240201-073053-marostegui.json [07:32:00] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/994884 (owner: 10Slyngshede) [07:32:07] (03CR) 10Slyngshede: [C: 03+2] debmonitor: switch over to new Bookworm hosts. [dns] - 10https://gerrit.wikimedia.org/r/994884 (owner: 10Slyngshede) [07:32:25] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:33:06] (03CR) 10Majavah: [C: 03+2] admin: remove non-yubikey key from taavi [puppet] - 10https://gerrit.wikimedia.org/r/994796 (owner: 10Majavah) [07:33:37] !log Failover debmonitor to new Bookworm host [07:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P56031 and previous config saved to /var/cache/conftool/dbconfig/20240201-074218-root.json [07:45:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 100%: After switchover ', diff saved to https://phabricator.wikimedia.org/P56032 and previous config saved to /var/cache/conftool/dbconfig/20240201-074520-root.json [07:55:07] (03CR) 10Brouberol: [C: 03+1] "Looks good! Thanks for the detailed commit message, that was helpful" [puppet] - 10https://gerrit.wikimedia.org/r/994812 (owner: 10Volans) [07:55:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T355609)', diff saved to https://phabricator.wikimedia.org/P56033 and previous config saved to /var/cache/conftool/dbconfig/20240201-075545-marostegui.json [07:55:49] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:57:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P56034 and previous config saved to /var/cache/conftool/dbconfig/20240201-075723-root.json [08:00:04] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240201T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:41] (03PS1) 10Hoo man: Add wgVirtualDomainsMapping for Cognate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994922 (https://phabricator.wikimedia.org/T348526) [08:03:20] Amir1: Do you have a second for a quick look at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/994922 ? [08:03:53] If there are no objections, I'd like to SWAT this out in this window [08:06:22] (03CR) 10Brouberol: wdqs.data_transfer: refactor spicerack class api (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [08:10:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P56035 and previous config saved to /var/cache/conftool/dbconfig/20240201-081051-marostegui.json [08:12:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P56036 and previous config saved to /var/cache/conftool/dbconfig/20240201-081228-root.json [08:13:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet [08:13:51] 10SRE, 10Infrastructure-Foundations: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10ops-monitoring-bot) Draining ganeti-test2003.codfw.wmnet of running VMs [08:15:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [08:21:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:21:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [08:22:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet [08:25:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P56038 and previous config saved to /var/cache/conftool/dbconfig/20240201-082558-marostegui.json [08:26:54] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet [08:27:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P56039 and previous config saved to /var/cache/conftool/dbconfig/20240201-082733-root.json [08:33:27] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test1001.eqiad.wmnet [08:40:28] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host acmechief-test2001.codfw.wmnet [08:41:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T355609)', diff saved to https://phabricator.wikimedia.org/P56040 and previous config saved to /var/cache/conftool/dbconfig/20240201-084104-marostegui.json [08:41:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2119.codfw.wmnet with reason: Maintenance [08:41:10] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:41:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2119.codfw.wmnet with reason: Maintenance [08:41:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2119 (T355609)', diff saved to https://phabricator.wikimedia.org/P56041 and previous config saved to /var/cache/conftool/dbconfig/20240201-084126-marostegui.json [08:42:06] !log Restarting Gerrit replica on gerrit2002 [08:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P56042 and previous config saved to /var/cache/conftool/dbconfig/20240201-084238-root.json [08:44:25] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test2001.codfw.wmnet [08:45:35] 10SRE, 10SRE-Access-Requests: Requesting access to deployment or deploy-service group for sbailey(WMF) - https://phabricator.wikimedia.org/T355612 (10ABran-WMF) SSH key has been confirmed out of band, @SLopes-WMF and @thcipriani have the remaining blocker in their hand [08:45:47] 10SRE, 10SRE-Access-Requests: Requesting access to deployment or deploy-service group for sbailey(WMF) - https://phabricator.wikimedia.org/T355612 (10ABran-WMF) [08:51:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:52:08] !log Restarted primary Gerrit on gerrit1003 [08:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:26] (03CR) 10DCausse: [C: 03+1] "no objections to try this out but it's possible that the jvm is going to use all the additional mem that it's given, if that happens we ca" [deployment-charts] - 10https://gerrit.wikimedia.org/r/994197 (https://phabricator.wikimedia.org/T266216) (owner: 10Alexandros Kosiaris) [08:57:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P56043 and previous config saved to /var/cache/conftool/dbconfig/20240201-085743-root.json [09:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240201T0900) [09:00:39] that will happen later tonight, Ahmon is running the train [09:06:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T355609)', diff saved to https://phabricator.wikimedia.org/P56044 and previous config saved to /var/cache/conftool/dbconfig/20240201-090607-marostegui.json [09:06:13] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:06:21] 10SRE, 10Wikimedia-Etherpad: Etherpad need restore to previous revision - https://phabricator.wikimedia.org/T356376 (10SCP-2000) [09:06:57] (03PS2) 10Volans: firewall: fix nftables metric exporter [puppet] - 10https://gerrit.wikimedia.org/r/994794 (https://phabricator.wikimedia.org/T356305) [09:07:05] (03CR) 10Volans: firewall: fix nftables metric exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/994794 (https://phabricator.wikimedia.org/T356305) (owner: 10Volans) [09:08:34] !log vgutierrez@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief2002.codfw.wmnet [09:12:24] !log vgutierrez@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief2002.codfw.wmnet [09:12:28] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:13:44] (03PS1) 10Filippo Giunchedi: oauth2-proxy: add ca-certificates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994986 (https://phabricator.wikimedia.org/T320555) [09:14:49] !log vgutierrez@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief2001.codfw.wmnet [09:15:36] (03CR) 10Volans: [C: 03+2] firewall: fix nftables metric exporter [puppet] - 10https://gerrit.wikimedia.org/r/994794 (https://phabricator.wikimedia.org/T356305) (owner: 10Volans) [09:17:28] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:18:20] (03CR) 10Volans: [C: 03+2] "Thanks for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/994812 (owner: 10Volans) [09:18:25] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Etherpad need restore to previous revision - https://phabricator.wikimedia.org/T356376 (10LSobanski) [09:18:27] (03PS2) 10Volans: requestctl-generator: fix bug for URI filters [puppet] - 10https://gerrit.wikimedia.org/r/994812 [09:18:38] !log vgutierrez@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief2001.codfw.wmnet [09:18:58] vgutierrez: keyholder complaining FYI ^^^ [09:19:33] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) Manually configuring IPv6 is straightforward as well once we know a couple points : When enabling forwarding on an interface (for example... [09:19:47] volans: timing issues.. I armed it a few minutes ago already [09:19:52] thx for the ping :) [09:20:15] !log vgutierrez@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief1002.eqiad.wmnet [09:20:16] then sorry for the unnecessary ping, too bad AM didn't get it in time [09:21:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P56045 and previous config saved to /var/cache/conftool/dbconfig/20240201-092115-marostegui.json [09:22:51] the alert fired at :12 and resolved at :17 volans FWIW [09:24:29] !log vgutierrez@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1002.eqiad.wmnet [09:25:01] (03PS2) 10Volans: requestctl-generator: adapt for superset 3 API [puppet] - 10https://gerrit.wikimedia.org/r/994811 (https://phabricator.wikimedia.org/T335356) [09:25:18] !log klausman@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: JRE update for DSA 5604 - klausman@cumin2002 [09:26:28] !log vgutierrez@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief1001.eqiad.wmnet [09:29:23] (03PS1) 10Muehlenhoff: Switch old debmonitor nodes to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/994987 (https://phabricator.wikimedia.org/T241049) [09:29:40] (03CR) 10Volans: [C: 03+1] "ready to be merged once superset is upgraded" [puppet] - 10https://gerrit.wikimedia.org/r/994811 (https://phabricator.wikimedia.org/T335356) (owner: 10Volans) [09:30:19] !log vgutierrez@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1001.eqiad.wmnet [09:30:53] (03CR) 10Muehlenhoff: [C: 03+2] Switch old debmonitor nodes to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/994987 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [09:32:28] (03CR) 10Btullis: [C: 03+1] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/994811 (https://phabricator.wikimedia.org/T335356) (owner: 10Volans) [09:33:04] a [09:33:26] (03PS1) 10Arnaudb: admin: add arinaigum to users [puppet] - 10https://gerrit.wikimedia.org/r/995006 (https://phabricator.wikimedia.org/T355591) [09:33:57] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10ABran-WMF) [09:36:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P56046 and previous config saved to /var/cache/conftool/dbconfig/20240201-093621-marostegui.json [09:41:45] (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:43:01] !log klausman@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: JRE update for DSA 5604 - klausman@cumin2002 [09:43:20] !log klausman@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: JRE update for DSA 5604 - klausman@cumin2002 [09:44:36] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add jaeger config for SSO oidc [puppet] - 10https://gerrit.wikimedia.org/r/994664 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [09:44:59] moritzm: merged your patch too [09:49:00] !log joal@deploy2002 Started deploy [airflow-dags/analytics@6b84b7a]: (no justification provided) [09:49:28] !log joal@deploy2002 Finished deploy [airflow-dags/analytics@6b84b7a]: (no justification provided) (duration: 00m 28s) [09:51:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T355609)', diff saved to https://phabricator.wikimedia.org/P56047 and previous config saved to /var/cache/conftool/dbconfig/20240201-095128-marostegui.json [09:51:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2136.codfw.wmnet with reason: Maintenance [09:51:36] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:51:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2136.codfw.wmnet with reason: Maintenance [09:51:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2136 (T355609)', diff saved to https://phabricator.wikimedia.org/P56048 and previous config saved to /var/cache/conftool/dbconfig/20240201-095150-marostegui.json [10:01:03] !log klausman@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: JRE update for DSA 5604 - klausman@cumin2002 [10:01:48] godog: ah, thanks. got distraczed [10:07:49] (03PS1) 10Muehlenhoff: debmonitor: Move Hiera entries formerly specific to the new nodes to role [puppet] - 10https://gerrit.wikimedia.org/r/994990 (https://phabricator.wikimedia.org/T241049) [10:08:20] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:09:08] !log btullis@deploy2002 Started deploy [analytics/superset/deploy@26c0d49]: (no justification provided) [10:10:07] !log btullis@deploy2002 Finished deploy [analytics/superset/deploy@26c0d49]: (no justification provided) (duration: 00m 59s) [10:11:41] !log Restarting CI Jenkins on contint2002 [10:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:21] (03CR) 10Jelto: [V: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/994685 (https://phabricator.wikimedia.org/T252310) (owner: 10Hashar) [10:17:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T355609)', diff saved to https://phabricator.wikimedia.org/P56049 and previous config saved to /var/cache/conftool/dbconfig/20240201-101733-marostegui.json [10:17:37] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:19:31] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1256/console" [puppet] - 10https://gerrit.wikimedia.org/r/994685 (https://phabricator.wikimedia.org/T252310) (owner: 10Hashar) [10:20:25] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm, let me know when this should be merged" [puppet] - 10https://gerrit.wikimedia.org/r/994685 (https://phabricator.wikimedia.org/T252310) (owner: 10Hashar) [10:24:19] (03PS1) 10Arnaudb: admin: remove goransm production access [puppet] - 10https://gerrit.wikimedia.org/r/995007 (https://phabricator.wikimedia.org/T356279) [10:26:01] (03CR) 10CI reject: [V: 04-1] admin: remove goransm production access [puppet] - 10https://gerrit.wikimedia.org/r/995007 (https://phabricator.wikimedia.org/T356279) (owner: 10Arnaudb) [10:27:12] (03PS1) 10Btullis: Enable the presto nested data feature by default on superset [puppet] - 10https://gerrit.wikimedia.org/r/994995 (https://phabricator.wikimedia.org/T335356) [10:28:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] oauth2-proxy: add ca-certificates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994986 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [10:28:31] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/994995 (https://phabricator.wikimedia.org/T335356) (owner: 10Btullis) [10:30:45] (03CR) 10Muehlenhoff: [C: 04-1] "See the ticket for recent updates" [puppet] - 10https://gerrit.wikimedia.org/r/995007 (https://phabricator.wikimedia.org/T356279) (owner: 10Arnaudb) [10:32:06] !log installing openjdk-11 security updates [10:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:36] (03CR) 10Btullis: [V: 03+1 C: 03+2] Enable the presto nested data feature by default on superset [puppet] - 10https://gerrit.wikimedia.org/r/994995 (https://phabricator.wikimedia.org/T335356) (owner: 10Btullis) [10:32:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P56051 and previous config saved to /var/cache/conftool/dbconfig/20240201-103239-marostegui.json [10:38:11] (03PS1) 10Ayounsi: Routed Ganeti: enable IPv6 forwarding [puppet] - 10https://gerrit.wikimedia.org/r/994997 (https://phabricator.wikimedia.org/T300152) [10:39:01] !log phuedx@deploy2002 Started deploy [analytics/refinery@0d8e976]: analytics/refinery: Remove trvwikisource from scoop list [10:39:06] (03PS1) 10Slyngshede: P:debmonitor::server enable django/uwsgi logging. [puppet] - 10https://gerrit.wikimedia.org/r/994998 [10:39:20] (03CR) 10CI reject: [V: 04-1] Routed Ganeti: enable IPv6 forwarding [puppet] - 10https://gerrit.wikimedia.org/r/994997 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:40:45] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: httpbb needs to be setup on cumin1002 and removed from cumin1001 - https://phabricator.wikimedia.org/T356054 (10Clement_Goubert) a:03Scott_French [10:43:24] (03CR) 10CI reject: [V: 04-1] P:debmonitor::server enable django/uwsgi logging. [puppet] - 10https://gerrit.wikimedia.org/r/994998 (owner: 10Slyngshede) [10:47:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P56052 and previous config saved to /var/cache/conftool/dbconfig/20240201-104746-marostegui.json [10:49:21] !log phuedx@deploy2002 Finished deploy [analytics/refinery@0d8e976]: analytics/refinery: Remove trvwikisource from scoop list (duration: 10m 20s) [10:50:38] !log phuedx@deploy2002 Started deploy [analytics/refinery@0d8e976] (thin): Remove trvwikisource from scoop list [10:50:43] !log phuedx@deploy2002 Finished deploy [analytics/refinery@0d8e976] (thin): Remove trvwikisource from scoop list (duration: 00m 05s) [10:51:05] (03Abandoned) 10Arnaudb: admin: remove goransm production access [puppet] - 10https://gerrit.wikimedia.org/r/995007 (https://phabricator.wikimedia.org/T356279) (owner: 10Arnaudb) [10:51:26] !log phuedx@deploy2002 Started deploy [analytics/refinery@0d8e976] (hadoop-test): Remove trvwikisource from scoop list [10:51:40] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10ABran-WMF) p:05High→03Medium [10:51:53] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/994990 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [10:52:06] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::hadoop::yarn [10:52:39] (03PS1) 10Andrea Denisse: grafana: Enable stunnel for Loki data transfer [puppet] - 10https://gerrit.wikimedia.org/r/994999 (https://phabricator.wikimedia.org/T352665) [10:53:35] (03PS2) 10Slyngshede: P:debmonitor::server enable django/uwsgi logging. [puppet] - 10https://gerrit.wikimedia.org/r/994998 [10:54:18] (03PS1) 10Muehlenhoff: Switch hadoop/yarn to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/995000 (https://phabricator.wikimedia.org/T349619) [10:54:56] !log phuedx@deploy2002 Finished deploy [analytics/refinery@0d8e976] (hadoop-test): Remove trvwikisource from scoop list (duration: 03m 30s) [10:58:22] (03CR) 10Muehlenhoff: [C: 03+2] Switch hadoop/yarn to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/995000 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:00:04] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240201T1100). [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240201T1100) [11:02:43] (03CR) 10Hnowlan: [C: 03+1] sessionstore: remove EOL hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/994830 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [11:02:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T355609)', diff saved to https://phabricator.wikimedia.org/P56053 and previous config saved to /var/cache/conftool/dbconfig/20240201-110252-marostegui.json [11:02:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance [11:03:00] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:03:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance [11:03:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2137:3314 (T355609)', diff saved to https://phabricator.wikimedia.org/P56054 and previous config saved to /var/cache/conftool/dbconfig/20240201-110315-marostegui.json [11:04:01] (03PS2) 10Arnaudb: mariadb: migrate core multi-instances nodes [puppet] - 10https://gerrit.wikimedia.org/r/994769 (https://phabricator.wikimedia.org/T344036) [11:05:57] (03CR) 10MVernon: [C: 03+1] sessionstore: remove EOL hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/994830 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [11:06:21] (03CR) 10Arnaudb: "so I'll let you use your own process Jaime as it seems to be more efficient. I've re-enabled notifications for impacted hosts on this patc" [puppet] - 10https://gerrit.wikimedia.org/r/994769 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [11:06:53] (03CR) 10Marostegui: [C: 03+1] mariadb: migrate core multi-instances nodes [puppet] - 10https://gerrit.wikimedia.org/r/994769 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [11:07:22] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/994341 (https://phabricator.wikimedia.org/T355979) (owner: 10EoghanGaffney) [11:07:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::hadoop::yarn [11:09:05] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.6 - https://phabricator.wikimedia.org/T316421 (10akosiaris) >>! In T316421#9501760, @Jelto wrote: > To build the new Debian package for etherpad 1.9.6 I need access to the `packaging` wmcs project. According to [op... [11:09:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1008.eqiad.wmnet [11:10:08] (03PS3) 10Slyngshede: P:debmonitor::server enable django/uwsgi logging. [puppet] - 10https://gerrit.wikimedia.org/r/994998 [11:11:25] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1259/co" [puppet] - 10https://gerrit.wikimedia.org/r/994998 (owner: 10Slyngshede) [11:13:16] (03CR) 10Jcrespo: "Please let me know when the backup source hosts are ready for data population and I will do that and the backup config change." [puppet] - 10https://gerrit.wikimedia.org/r/994769 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [11:13:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1008.eqiad.wmnet [11:14:07] (03PS4) 10Slyngshede: P:debmonitor::server enable django/uwsgi logging. [puppet] - 10https://gerrit.wikimedia.org/r/994998 [11:15:20] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10fnegri) [11:15:26] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1260/co" [puppet] - 10https://gerrit.wikimedia.org/r/994998 (owner: 10Slyngshede) [11:17:06] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] oauth2-proxy: add ca-certificates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994986 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [11:18:53] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:19:33] (03CR) 10Majavah: [C: 03+2] P:openstack: neutron: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991773 (https://phabricator.wikimedia.org/T355417) (owner: 10Majavah) [11:20:31] (03PS1) 10Filippo Giunchedi: jaeger: tag oauth2-proxy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/995003 (https://phabricator.wikimedia.org/T320555) [11:20:41] (03PS3) 10Arnaudb: mariadb: migrate core multi-instances nodes [puppet] - 10https://gerrit.wikimedia.org/r/994769 (https://phabricator.wikimedia.org/T344036) [11:20:51] (03CR) 10Arnaudb: "forgot to trim hosts from site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/994769 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [11:21:01] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:21:05] (03PS5) 10Slyngshede: P:debmonitor::server enable django/uwsgi logging. [puppet] - 10https://gerrit.wikimedia.org/r/994998 [11:21:37] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1261/co" [puppet] - 10https://gerrit.wikimedia.org/r/994998 (owner: 10Slyngshede) [11:22:38] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:23:31] (03CR) 10Filippo Giunchedi: "Thank you for the rationale/explanations" [puppet] - 10https://gerrit.wikimedia.org/r/993476 (https://phabricator.wikimedia.org/T355836) (owner: 10Cwhite) [11:27:04] (03CR) 10Filippo Giunchedi: "LGTM, to do post-reimage of course" [puppet] - 10https://gerrit.wikimedia.org/r/994999 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [11:30:58] (03CR) 10Clément Goubert: "Should we add the environment variable to the Dockerfile.template's ENV directive? This way the possible environment variables are referen" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:39:15] (03CR) 10Effie Mouzeli: "I see your point, albeit that the status quo for this setting, has always been mc.php. In my opinion, that is the first place eg a dev wil" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:49:49] (03CR) 10EoghanGaffney: [C: 03+2] [vrts] Switch from RuntimeDB to StaticDB for queue indexes [puppet] - 10https://gerrit.wikimedia.org/r/994341 (https://phabricator.wikimedia.org/T355979) (owner: 10EoghanGaffney) [11:51:24] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/995006 (https://phabricator.wikimedia.org/T355591) (owner: 10Arnaudb) [11:57:01] (03Abandoned) 10Muehlenhoff: Failover debmonitor to debmonitor1003 [dns] - 10https://gerrit.wikimedia.org/r/994881 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [11:57:14] (03CR) 10Muehlenhoff: [C: 03+2] debmonitor: Move Hiera entries formerly specific to the new nodes to role [puppet] - 10https://gerrit.wikimedia.org/r/994990 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [11:58:20] (03CR) 10Clément Goubert: "That's fair. In that case we may want to amend the mediawiki chart's value.yaml L193 and following that refers to the Dockerfile as the so" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [12:00:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2005.codfw.wmnet [12:03:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T355609)', diff saved to https://phabricator.wikimedia.org/P56056 and previous config saved to /var/cache/conftool/dbconfig/20240201-120346-marostegui.json [12:03:56] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:04:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2005.codfw.wmnet [12:04:32] (03PS2) 10Clément Goubert: ipoid: Fix probe definition [puppet] - 10https://gerrit.wikimedia.org/r/995005 [12:05:14] (03CR) 10Volans: [C: 03+2] requestctl-generator: adapt for superset 3 API [puppet] - 10https://gerrit.wikimedia.org/r/994811 (https://phabricator.wikimedia.org/T335356) (owner: 10Volans) [12:05:49] 10SRE, 10Infrastructure-Foundations: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10MoritzMuehlenhoff) [12:06:36] (03CR) 10Kosta Harlan: "We could add a healthz endpoint if you prefer. Our current usage of swagger is pretty minimal." [puppet] - 10https://gerrit.wikimedia.org/r/995005 (owner: 10Clément Goubert) [12:06:47] (03CR) 10Ladsgroup: Add wgVirtualDomainsMapping for Cognate (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994922 (https://phabricator.wikimedia.org/T348526) (owner: 10Hoo man) [12:09:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host arclamp2001.codfw.wmnet [12:12:37] (03CR) 10Clément Goubert: [V: 03+1] "The _info endpoint will work as well for a probe, since it's also used as the liveness probe in kubernetes." [puppet] - 10https://gerrit.wikimedia.org/r/995005 (owner: 10Clément Goubert) [12:12:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10BTullis) a:03BTullis I'm going to have a crack at these reimages, if that's OK. Please let me know if I tread on anyone's t... [12:13:50] (03CR) 10Kosta Harlan: [C: 03+1] ipoid: Fix probe definition [puppet] - 10https://gerrit.wikimedia.org/r/995005 (owner: 10Clément Goubert) [12:14:14] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:16] (03PS3) 10Clément Goubert: ipoid: Fix probe definition [puppet] - 10https://gerrit.wikimedia.org/r/995005 [12:14:41] (03CR) 10Kosta Harlan: ipoid: Fix probe definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/995005 (owner: 10Clément Goubert) [12:15:19] (03PS4) 10Clément Goubert: ipoid: Fix probe definition [puppet] - 10https://gerrit.wikimedia.org/r/995005 (https://phabricator.wikimedia.org/T325147) [12:15:33] (03CR) 10Clément Goubert: ipoid: Fix probe definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/995005 (https://phabricator.wikimedia.org/T325147) (owner: 10Clément Goubert) [12:15:37] (03CR) 10Marostegui: [C: 03+1] mariadb: migrate core multi-instances nodes [puppet] - 10https://gerrit.wikimedia.org/r/994769 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [12:15:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host arclamp2001.codfw.wmnet [12:17:39] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host zookeeper-test1002.eqiad.wmnet [12:18:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host arclamp1001.eqiad.wmnet [12:18:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P56057 and previous config saved to /var/cache/conftool/dbconfig/20240201-121853-marostegui.json [12:21:36] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zookeeper-test1002.eqiad.wmnet [12:24:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host arclamp1001.eqiad.wmnet [12:27:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/994998 (owner: 10Slyngshede) [12:29:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [12:29:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [12:30:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [12:30:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [12:31:09] (03CR) 10Muehlenhoff: [C: 03+1] P:debmonitor::server enable django/uwsgi logging. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/994998 (owner: 10Slyngshede) [12:32:02] 10SRE, 10Infrastructure-Foundations: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10MoritzMuehlenhoff) [12:33:27] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10Ladsgroup) The pregen values are: ` amir@amir-ThinkPad-P1-Gen-3:~/mediawiki-config/wmf-config$ grep -ri -A10 UploadThumbnailRend... [12:34:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P56058 and previous config saved to /var/cache/conftool/dbconfig/20240201-123400-marostegui.json [12:36:35] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bullseye [12:36:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host elastic2088.codfw.wmnet wi... [12:37:04] 10SRE, 10Infrastructure-Foundations: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10MoritzMuehlenhoff) [12:46:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb2001-dev.codfw.wmnet [12:49:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T355609)', diff saved to https://phabricator.wikimedia.org/P56059 and previous config saved to /var/cache/conftool/dbconfig/20240201-124906-marostegui.json [12:49:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [12:49:11] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:49:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] ipoid: Fix probe definition [puppet] - 10https://gerrit.wikimedia.org/r/995005 (https://phabricator.wikimedia.org/T325147) (owner: 10Clément Goubert) [12:49:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [12:49:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2138:3314 (T355609)', diff saved to https://phabricator.wikimedia.org/P56060 and previous config saved to /var/cache/conftool/dbconfig/20240201-124928-marostegui.json [12:49:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] jaeger: tag oauth2-proxy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/995003 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [12:50:20] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:50:29] (03Merged) 10jenkins-bot: jaeger: tag oauth2-proxy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/995003 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [12:51:02] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:51:11] 10SRE, 10SRE-Access-Requests, 10User-aborrero: ops: add access for aborrero - https://phabricator.wikimedia.org/T356403 (10aborrero) [12:52:36] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:52:37] (03PS6) 10Slyngshede: P:debmonitor::server enable django/uwsgi logging. [puppet] - 10https://gerrit.wikimedia.org/r/994998 [12:53:21] (03PS1) 10Ayounsi: Routed Ganeti: Add v6 static route to VM [puppet] - 10https://gerrit.wikimedia.org/r/995032 (https://phabricator.wikimedia.org/T300152) [12:53:25] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Good point. Since it won't hurt to try this out I 'll deploy and let's revisit the JVM side if needed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/994197 (https://phabricator.wikimedia.org/T266216) (owner: 10Alexandros Kosiaris) [12:53:26] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:54:20] (03Merged) 10jenkins-bot: rdf-streaming-updated: Bump taskmanager memory limit by ~33% [deployment-charts] - 10https://gerrit.wikimedia.org/r/994197 (https://phabricator.wikimedia.org/T266216) (owner: 10Alexandros Kosiaris) [12:54:26] !log btullis@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2088.codfw.wmnet'] [12:55:11] !log btullis@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2088.codfw.wmnet'] [12:56:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/994998 (owner: 10Slyngshede) [12:56:35] (03CR) 10Clément Goubert: [C: 03+2] ipoid: Fix probe definition [puppet] - 10https://gerrit.wikimedia.org/r/995005 (https://phabricator.wikimedia.org/T325147) (owner: 10Clément Goubert) [12:57:38] !log btullis@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2088.codfw.wmnet'] [12:58:15] !log btullis@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2088.codfw.wmnet'] [12:58:25] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudlb2001-dev.codfw.wmnet [12:59:27] (03CR) 10Slyngshede: [C: 03+2] P:debmonitor::server enable django/uwsgi logging. [puppet] - 10https://gerrit.wikimedia.org/r/994998 (owner: 10Slyngshede) [12:59:28] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [12:59:30] PROBLEM - Check systemd state on cloudlb2001-dev is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:34] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240201T1300) [13:03:13] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [13:03:17] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [13:05:32] (03PS1) 10Slyngshede: P:debmonitor::server templates are content, not source. [puppet] - 10https://gerrit.wikimedia.org/r/995033 [13:05:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992899 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris) [13:05:54] (03PS2) 10Slyngshede: P:debmonitor::server templates are content, not source. [puppet] - 10https://gerrit.wikimedia.org/r/995033 [13:06:08] (03PS2) 10Alexandros Kosiaris: ipoid: Fix chart default ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/992899 (https://phabricator.wikimedia.org/T355167) [13:06:50] 10SRE, 10SRE-Access-Requests, 10User-aborrero: ops: add access for aborrero - https://phabricator.wikimedia.org/T356403 (10ABran-WMF) p:05Triage→03Medium a:03MoritzMuehlenhoff [13:07:44] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1264/co" [puppet] - 10https://gerrit.wikimedia.org/r/995033 (owner: 10Slyngshede) [13:07:57] (03CR) 10Arnaudb: [C: 03+2] admin: add arinaigum to users [puppet] - 10https://gerrit.wikimedia.org/r/995006 (https://phabricator.wikimedia.org/T355591) (owner: 10Arnaudb) [13:08:01] !log btullis@cumin1002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on A:datahubsearch [13:10:05] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10ABran-WMF) 05In progress→03Resolved a:03ABran-WMF everything should be settled now [13:10:46] (03CR) 10Slyngshede: [V: 03+1] "Not sure why PCC didn't spot this" [puppet] - 10https://gerrit.wikimedia.org/r/995033 (owner: 10Slyngshede) [13:12:15] (03CR) 10Muehlenhoff: [C: 03+1] P:debmonitor::server templates are content, not source. [puppet] - 10https://gerrit.wikimedia.org/r/995033 (owner: 10Slyngshede) [13:13:23] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:debmonitor::server templates are content, not source. [puppet] - 10https://gerrit.wikimedia.org/r/995033 (owner: 10Slyngshede) [13:14:17] (03CR) 10Arnaudb: [C: 03+2] mariadb: migrate core multi-instances nodes [puppet] - 10https://gerrit.wikimedia.org/r/994769 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [13:14:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/993190 (owner: 10JHathaway) [13:14:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T355609)', diff saved to https://phabricator.wikimedia.org/P56061 and previous config saved to /var/cache/conftool/dbconfig/20240201-131432-marostegui.json [13:14:47] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [13:15:43] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [13:15:52] !log btullis@cumin1002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:datahubsearch [13:16:15] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [13:20:38] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [13:21:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10BTullis) I tried a reimage of elastic2008 and it completely hung at the PXE prompt, for at least 20 minutes before I switched... [13:22:26] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2088.codfw.wmnet with reason: host reimage [13:23:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb2002-dev.codfw.wmnet [13:24:44] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [13:24:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] mw-debug: Enable tracing with 100% sampling [deployment-charts] - 10https://gerrit.wikimedia.org/r/994193 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [13:25:40] (03Merged) 10jenkins-bot: mw-debug: Enable tracing with 100% sampling [deployment-charts] - 10https://gerrit.wikimedia.org/r/994193 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [13:25:46] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2088.codfw.wmnet with reason: host reimage [13:26:52] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:27:07] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:27:16] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:29:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: provisionning db1234.eqiad.wmnet - T350458 [13:29:13] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:29:20] 10SRE, 10Infrastructure-Foundations: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10MoritzMuehlenhoff) [13:29:21] T350458: Decommission db11[26-49] - https://phabricator.wikimedia.org/T350458 [13:29:26] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:29:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: provisionning db1234.eqiad.wmnet - T350458 [13:29:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: provisionning db1234.eqiad.wmnet - T350458 [13:29:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P56062 and previous config saved to /var/cache/conftool/dbconfig/20240201-132938-marostegui.json [13:29:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: provisionning db1234.eqiad.wmnet - T350458 [13:29:48] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:30:54] (03PS1) 10Hashar: Add rename-project plugin [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/995035 (https://phabricator.wikimedia.org/T201953) [13:31:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [13:31:28] !log btullis@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bullseye [13:33:48] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [13:35:26] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudlb2002-dev.codfw.wmnet [13:36:17] PROBLEM - Check systemd state on cloudlb2002-dev is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:10] (03CR) 10CI reject: [V: 04-1] Add rename-project plugin [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/995035 (https://phabricator.wikimedia.org/T201953) (owner: 10Hashar) [13:37:33] (03PS6) 10Ilias Sarantopoulos: admin_ng: elevate ml users experimental permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) [13:38:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: provisionning db1244.eqiad.wmnet - T350458 [13:38:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: provisionning db1244.eqiad.wmnet - T350458 [13:38:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: provisionning db1244.eqiad.wmnet - T350458 [13:38:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: provisionning db1244.eqiad.wmnet - T350458 [13:39:00] T350458: Decommission db11[26-49] - https://phabricator.wikimedia.org/T350458 [13:39:24] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088 (10brouberol) I've taken a couple of hours to whip up this [[ https://gitlab.wikimedia.org/repos/sre/kafka-configurator | PoC ]], v... [13:39:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10BTullis) I gave elastic2094 a cold boot, then started the reimage cookbook. It is reporting the following error on the consol... [13:40:07] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [13:40:48] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [13:41:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db1144 in db1244 for T350458', diff saved to https://phabricator.wikimedia.org/P56064 and previous config saved to /var/cache/conftool/dbconfig/20240201-134107-arnaudb.json [13:42:18] !log btullis@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2094.codfw.wmnet'] [13:44:08] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db1144.eqiad.wmnet onto db1244.eqiad.wmnet [13:44:19] !log btullis@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2094.codfw.wmnet'] [13:44:30] !log btullis@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2094.codfw.wmnet'] [13:44:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P56065 and previous config saved to /var/cache/conftool/dbconfig/20240201-134445-marostegui.json [13:47:08] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [13:48:52] (03CR) 10Volans: wdqs.data_transfer: refactor spicerack class api (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [13:56:05] (03PS2) 10Hashar: Add rename-project plugin [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/995035 (https://phabricator.wikimedia.org/T201953) [13:59:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T355609)', diff saved to https://phabricator.wikimedia.org/P56066 and previous config saved to /var/cache/conftool/dbconfig/20240201-135951-marostegui.json [13:59:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:00:01] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240201T1400). [14:00:04] Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:00:16] i can deploy today [14:00:21] Daimona: you ready? :D [14:00:38] yup [14:00:43] let's started :) [14:01:00] (03PS3) 10Urbanecm: beta: Configure Fluxx endpoint for WikimediaCampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994866 (https://phabricator.wikimedia.org/T347894) (owner: 10Daimona Eaytoy) [14:01:03] (03CR) 10Urbanecm: [C: 03+2] beta: Configure Fluxx endpoint for WikimediaCampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994866 (https://phabricator.wikimedia.org/T347894) (owner: 10Daimona Eaytoy) [14:01:12] (03PS3) 10Urbanecm: private/readme.php: Add stubs for WikimediaCampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994867 (https://phabricator.wikimedia.org/T347894) (owner: 10Daimona Eaytoy) [14:01:15] (03CR) 10Urbanecm: [C: 03+2] private/readme.php: Add stubs for WikimediaCampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994867 (https://phabricator.wikimedia.org/T347894) (owner: 10Daimona Eaytoy) [14:01:19] (03PS3) 10Urbanecm: Update commonsettings-labs to enable WikimediaCampaignEvents extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994179 (https://phabricator.wikimedia.org/T347894) (owner: 10Mhorsey) [14:01:48] (03Merged) 10jenkins-bot: beta: Configure Fluxx endpoint for WikimediaCampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994866 (https://phabricator.wikimedia.org/T347894) (owner: 10Daimona Eaytoy) [14:02:03] (03Merged) 10jenkins-bot: private/readme.php: Add stubs for WikimediaCampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994867 (https://phabricator.wikimedia.org/T347894) (owner: 10Daimona Eaytoy) [14:02:37] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [14:05:04] (03PS1) 10Muehlenhoff: debmonitor: Remove support for old deployment method [puppet] - 10https://gerrit.wikimedia.org/r/995040 (https://phabricator.wikimedia.org/T241049) [14:06:17] PROBLEM - Check systemd state on install2004 is CRITICAL: CRITICAL - degraded: The following units failed: isc-dhcp-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:21] RECOVERY - Check systemd state on install2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:04] (03CR) 10Urbanecm: [C: 03+2] Update commonsettings-labs to enable WikimediaCampaignEvents extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994179 (https://phabricator.wikimedia.org/T347894) (owner: 10Mhorsey) [14:09:48] (03Merged) 10jenkins-bot: Update commonsettings-labs to enable WikimediaCampaignEvents extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994179 (https://phabricator.wikimedia.org/T347894) (owner: 10Mhorsey) [14:12:14] 10SRE, 10SRE-Access-Requests, 10User-aborrero: ops: add access for aborrero - https://phabricator.wikimedia.org/T356403 (10joanna_borun) Approved [14:12:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: provisionning db1246.eqiad.wmnet - T350458 [14:13:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: provisionning db1246.eqiad.wmnet - T350458 [14:13:06] T350458: Decommission db11[26-49] - https://phabricator.wikimedia.org/T350458 [14:13:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: provisionning db1246.eqiad.wmnet - T350458 [14:13:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: provisionning db1246.eqiad.wmnet - T350458 [14:15:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db1146 in db1246 for T350458', diff saved to https://phabricator.wikimedia.org/P56067 and previous config saved to /var/cache/conftool/dbconfig/20240201-141531-arnaudb.json [14:16:06] !log btullis@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic2094.codfw.wmnet'] [14:16:34] urbanecm: i have a doubt about T352424 since there's no community consensus on that discussion is it ok to add draft and portal namespace on my.wikipedia.org [14:16:35] T352424: Create Portal and Draft namespaces in mywiki - https://phabricator.wikimedia.org/T352424 [14:17:57] (03PS1) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) [14:18:35] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db1146.eqiad.wmnet onto db1246.eqiad.wmnet [14:19:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2147.codfw.wmnet with reason: Maintenance [14:20:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2147.codfw.wmnet with reason: Maintenance [14:20:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T355609)', diff saved to https://phabricator.wikimedia.org/P56068 and previous config saved to /var/cache/conftool/dbconfig/20240201-142009-marostegui.json [14:21:16] (03PS1) 10Slyngshede: P:debmonitor::server increase uwsgi buffer size. [puppet] - 10https://gerrit.wikimedia.org/r/995043 [14:21:30] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [14:22:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/995043 (owner: 10Slyngshede) [14:22:49] (03PS1) 10Vivian Rook: Move prometheus inside paws cluster [puppet] - 10https://gerrit.wikimedia.org/r/995044 (https://phabricator.wikimedia.org/T355179) [14:23:11] (03CR) 10Majavah: "just to confirm: does Elastic use any certificates for encrypting the inter-cluster traffic?" [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [14:23:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [14:25:27] (03CR) 10Slyngshede: [C: 03+2] P:debmonitor::server increase uwsgi buffer size. [puppet] - 10https://gerrit.wikimedia.org/r/995043 (owner: 10Slyngshede) [14:26:17] (03CR) 10Bking: "No, it's all cleartext ATM as that's a pay feature in Elastic. We will explore TLS for inter-cluster communication as part of the Opensear" [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [14:29:20] (03PS2) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) [14:29:51] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [14:31:10] !log btullis@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2094.codfw.wmnet'] [14:32:42] (03PS3) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) [14:33:00] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [14:33:48] 10SRE, 10SRE-swift-storage: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412 (10MatthewVernon) [14:34:24] 10SRE, 10SRE-swift-storage: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412 (10MatthewVernon) [it's not immediately obvious to me what the extra work of `cfssl` gets us over `sslcert`] [14:35:28] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: elevate ml users experimental permissions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) (owner: 10Ilias Sarantopoulos) [14:37:00] (03CR) 10David Caro: [C: 03+1] "LGTM, the prometheus part only applies to the prometheus VMs though" [puppet] - 10https://gerrit.wikimedia.org/r/995044 (https://phabricator.wikimedia.org/T355179) (owner: 10Vivian Rook) [14:38:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995040 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [14:39:28] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:02] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:42:32] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:44:23] (03PS2) 10Vivian Rook: Move prometheus inside paws cluster [puppet] - 10https://gerrit.wikimedia.org/r/995044 (https://phabricator.wikimedia.org/T355179) [14:46:00] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2088.codfw.wmnet with OS bullseye [14:46:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2088.codfw.wmnet with O... [14:46:31] 10SRE, 10SRE-Access-Requests: Requesting access to deployment or deploy-service group for sbailey(WMF) - https://phabricator.wikimedia.org/T355612 (10SLopes-WMF) Go ahead, please. [14:49:05] (03PS1) 10Clément Goubert: kubernetes: make 3 appservers kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/995045 (https://phabricator.wikimedia.org/T351074) [14:50:39] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bullseye [14:50:45] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host elastic2088.codfw.wmnet wi... [14:51:43] !log btullis@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2094.codfw.wmnet with OS bullseye [14:51:53] !log btullis@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic2094.codfw.wmnet'] [14:54:02] (03PS1) 10MVernon: swift: remove ms-be20[44-50] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/995046 (https://phabricator.wikimedia.org/T353149) [14:54:34] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10Ganesha811) Hmm, so the nearest is 320px? That's not particularly close, given the current is 220 and the consensus was for a ch... [14:59:28] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10BTullis) It looks like elastic2094 may have some kind of hardware problem. {F41739906,width=60%} I have tried both cold booti... [14:59:44] (03CR) 10Marostegui: [C: 03+1] swift: remove ms-be20[44-50] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/995046 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon) [15:01:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) [15:01:55] Amir1: I must be missing something in the conditional defaults stuff... i just did an end-to-end run at testwiki. Started with 4948 rows, ended up with 32100. Not exactly what we wanted :D [15:02:03] (03CR) 10Hashar: [C: 03+1] "> I have cherry picked the patch on the integration Puppet master" [puppet] - 10https://gerrit.wikimedia.org/r/994685 (https://phabricator.wikimedia.org/T252310) (owner: 10Hashar) [15:02:18] oh fun [15:02:47] yeah... [15:03:20] it can be a lot of things: Writing multiple times on the same row? up_value being messed up in some cases (it's quite common...)? [15:04:28] yeah...something that needs investigating. [15:04:40] do you have the write logs? [15:05:04] it can be some bad counting by affectedRows() [15:05:07] in a tmux session [15:05:20] (03CR) 10Hnowlan: [C: 03+1] kubernetes: make 3 appservers kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/995045 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [15:05:36] i did not do affectedRows. i ran `select count(*) from user_properties where up_property='echo-subscriptions-web-reverted'; ` when i started and when i finished. [15:05:37] (03CR) 10Hnowlan: [C: 03+2] tegola-vector-tiles: add maps primaries to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/993700 (https://phabricator.wikimedia.org/T355892) (owner: 10Hnowlan) [15:05:41] that's supposed to be accurate [15:06:45] (03Merged) 10jenkins-bot: tegola-vector-tiles: add maps primaries to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/993700 (https://phabricator.wikimedia.org/T355892) (owner: 10Hnowlan) [15:08:00] (03CR) 10MVernon: [C: 03+2] swift: remove ms-be20[44-50] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/995046 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon) [15:12:40] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [15:13:03] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [15:14:34] (03CR) 10Vivian Rook: [C: 03+2] Move prometheus inside paws cluster [puppet] - 10https://gerrit.wikimedia.org/r/995044 (https://phabricator.wikimedia.org/T355179) (owner: 10Vivian Rook) [15:14:46] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/995015 [15:14:55] (03CR) 10DCausse: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [15:15:57] RECOVERY - Check systemd state on cloudlb2002-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:05] RECOVERY - Check systemd state on cloudlb2001-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:16] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [15:20:31] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [15:20:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T355609)', diff saved to https://phabricator.wikimedia.org/P56069 and previous config saved to /var/cache/conftool/dbconfig/20240201-152040-marostegui.json [15:21:03] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [15:25:17] (03CR) 10JHathaway: [C: 03+2] interface: add explicit Augeas lens [puppet] - 10https://gerrit.wikimedia.org/r/993190 (owner: 10JHathaway) [15:31:41] (03PS2) 10Hnowlan: admin_ng: bump overall limit for thumbor memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/994751 [15:31:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:35:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P56070 and previous config saved to /var/cache/conftool/dbconfig/20240201-153547-marostegui.json [15:36:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:37:24] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/993068 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [15:37:43] Amir1: i figured out what is happening. autocreated users were ignored by the hook, but i am _not_ ignoring them with what i did. [15:38:39] ah [15:38:44] so things are good [15:38:47] I guess [15:38:59] except i changed the preference values for users who were autocreated. not that big of a deal i guess. [15:39:01] (03CR) 10Clément Goubert: [C: 03+1] admin_ng: bump overall limit for thumbor memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/994751 (owner: 10Hnowlan) [15:40:00] (03PS4) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) [15:40:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [15:40:31] (03CR) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [15:42:45] (03CR) 10Clément Goubert: [C: 03+2] kubernetes: make 3 appservers kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/995045 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [15:43:49] (03CR) 10Hnowlan: [C: 03+2] admin_ng: bump overall limit for thumbor memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/994751 (owner: 10Hnowlan) [15:44:14] (03CR) 10Ilias Sarantopoulos: [C: 03+2] admin_ng: elevate ml users experimental permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) (owner: 10Ilias Sarantopoulos) [15:44:41] (03CR) 10Ilias Sarantopoulos: [C: 03+2] admin_ng: elevate ml users experimental permissions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) (owner: 10Ilias Sarantopoulos) [15:45:26] Amir1: good question is...what do i do to fix it? introducing `CUDCOND_AUTOCREATED`, to have special default for autocreated users? is that information i can fetch long after onLocalUserCreated has fired (it is in `centralauth.localuser`, but surely I don't want to make core depend on CentralAuth)? do you know / have any suggestions here? [15:46:12] (03PS5) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) [15:46:15] or, alternatively, how to convince userOptions to ignore autocreated users? [15:46:33] I honestly don't think there should be a difference between auto-created and non-autocreated users in terms of preferences [15:46:41] it made sense a decade ago, not anymore [15:46:56] (03Merged) 10jenkins-bot: admin_ng: bump overall limit for thumbor memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/994751 (owner: 10Hnowlan) [15:47:18] (03Merged) 10jenkins-bot: admin_ng: elevate ml users experimental permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) (owner: 10Ilias Sarantopoulos) [15:47:52] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:48:30] Amir1: in that case, i'd need to exclude them in the userOptions.php step. which needs an answer to more or less the same question: how do I find whether the user is autocreated or not at that stage? [15:48:31] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2448.codfw.wmnet with OS bullseye [15:48:34] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2449.codfw.wmnet with OS bullseye [15:48:39] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2447.codfw.wmnet with OS bullseye [15:49:02] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:49:16] (03PS1) 10Marostegui: db1106: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/995050 (https://phabricator.wikimedia.org/T327616) [15:49:27] (03CR) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [15:49:30] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:49:37] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [15:49:57] urbanecm: my saying is that it shouldn't care whether the user was auto created or not [15:50:03] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:50:11] even if it was created a decade ago [15:50:29] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:50:42] let's avoid adding complexity (or remove it in this case) when it doesn't make sense [15:50:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P56071 and previous config saved to /var/cache/conftool/dbconfig/20240201-155054-marostegui.json [15:51:05] (03CR) 10Marostegui: [C: 03+2] db1106: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/995050 (https://phabricator.wikimedia.org/T327616) (owner: 10Marostegui) [15:51:41] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:52:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1106 from dbctl T327616', diff saved to https://phabricator.wikimedia.org/P56072 and previous config saved to /var/cache/conftool/dbconfig/20240201-155203-marostegui.json [15:52:27] T327616: decommission db1106.eqiad.wmnet - https://phabricator.wikimedia.org/T327616 [15:55:06] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/995015 (owner: 10PipelineBot) [15:55:21] Amir1: i don't follow your suggestion. let me explain the problem i'm looking at: i'm changing the default value for a significant portion of our users. to ensure no one's preferences are changed, i'm adding some rows to user_properties and removing some others (with the hope that this will have a net-negative result). autocreated users (before the change) use the default value. because i _changed_ that default value, [15:55:21] treating them the same way as everyone else means inserting user_properties rows for those users (so that they have the same preference). from there, i have several choices: (1) i can ignore the fact that the preference state for autocreated users will switch (2) i can add a separate default just for them (3) i can make everyone's preferences state switch to the other value [15:55:25] (SystemdUnitFailed) firing: (2) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:33] (1) and (2) requires knowing (at this point) which accounts are autocreated [15:55:39] (3) is what you told me in our last 1:1 to avoid [15:55:53] i don't see a fourth solution rn [15:56:12] can you elaborate on what "let's not care whether the user was auto created or not" means? [15:56:31] because if i just treat those accounts the same way, i won't reduce the number of rows we use. [15:57:01] (i can also just enable conditional defaults starting from today, to stop the influx of new rows, and avoid dropping any rows. also an option.) [15:57:50] what i am missing here? [15:58:17] let's go with examples, user A is autocreated in enwiki, their default would be the general default. Correct? [15:58:22] yes [15:58:45] (03PS1) 10Jgiannelos: mobileapps: Bump staging version for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/995051 [15:59:47] (fwiw, i'm concerned of legacy autocreated users. not new autocreated users.) [16:00:15] I think I get the problem. During the switch, we don't want to add the rows for them and should let them be flushed and pick the conditional default [16:00:25] (SystemdUnitFailed) resolved: (2) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:01:11] exactly [16:01:28] and to skip them during the insert rows part, i need to be able to identify them [16:01:46] so the part that's adding the rows for users who have been auto created, we should ignore them. I'd say do an ad-hoc query to CA or something like that [16:02:04] and make userOptions.php accept a txt file with a list of user IDs? [16:02:14] anything you like [16:02:20] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump staging version for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/995051 (owner: 10Jgiannelos) [16:02:48] ideally, i'd like a single script that does that migration all by itself. so that i can write a writeup, that others can use to use conditional defaults elsewhere without running into the same problems as i did. [16:03:10] (03Merged) 10jenkins-bot: mobileapps: Bump staging version for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/995051 (owner: 10Jgiannelos) [16:03:21] `mwscript userOptions.php --wiki=xx --ignore-autocreated`, something like that [16:03:37] or maybe a one-off WikimediaMaintenance script? [16:04:15] I'd go with the latter, CA is not used outside of WMF [16:04:32] like it's used but very rarely [16:04:33] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [16:04:41] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2448.codfw.wmnet with reason: host reimage [16:05:44] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2447.codfw.wmnet with reason: host reimage [16:05:55] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2449.codfw.wmnet with reason: host reimage [16:06:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T355609)', diff saved to https://phabricator.wikimedia.org/P56073 and previous config saved to /var/cache/conftool/dbconfig/20240201-160600-marostegui.json [16:06:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2155.codfw.wmnet with reason: Maintenance [16:06:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2155.codfw.wmnet with reason: Maintenance [16:06:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [16:06:27] (03PS1) 10Jgiannelos: Fix staging image configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/995052 [16:06:30] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [16:06:35] (03CR) 10CI reject: [V: 04-1] Fix staging image configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/995052 (owner: 10Jgiannelos) [16:06:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [16:06:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T355609)', diff saved to https://phabricator.wikimedia.org/P56074 and previous config saved to /var/cache/conftool/dbconfig/20240201-160650-marostegui.json [16:07:04] (03PS2) 10Jgiannelos: Fix staging image configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/995052 [16:07:07] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:07:09] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:07:16] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:07:38] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [16:07:41] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2448.codfw.wmnet with reason: host reimage [16:07:51] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [16:08:08] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [16:08:11] (03PS2) 10Ayounsi: Routed Ganeti: Add v6 static route to VM [puppet] - 10https://gerrit.wikimedia.org/r/995032 (https://phabricator.wikimedia.org/T300152) [16:08:22] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [16:08:48] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995032 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [16:10:02] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2447.codfw.wmnet with reason: host reimage [16:10:03] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10mpopov) Before closing this task I'd like to get a confirmation from Goran whether the level of access is... [16:10:53] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2088.codfw.wmnet with OS bullseye [16:10:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2088.codfw.wmnet with OS... [16:12:59] (03CR) 10Sbailey: [C: 03+1] Fix staging image configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/995052 (owner: 10Jgiannelos) [16:13:05] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2449.codfw.wmnet with reason: host reimage [16:13:39] (03CR) 10Jgiannelos: [C: 03+2] Fix staging image configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/995052 (owner: 10Jgiannelos) [16:14:34] (03Merged) 10jenkins-bot: Fix staging image configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/995052 (owner: 10Jgiannelos) [16:15:48] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:16:20] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:19:01] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bullseye [16:19:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host elastic2088.codfw.wmnet wit... [16:20:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) I have restarted the reimage cookbook for elastic2088, I realise that I should have selected puppet 7 instead of pupp... [16:26:29] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2448.codfw.wmnet with OS bullseye [16:27:11] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Etherpad need restore to previous revision - https://phabricator.wikimedia.org/T356376 (10Dzahn) a:03Dzahn [16:29:59] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2447.codfw.wmnet with OS bullseye [16:30:14] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1265/co" [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [16:33:03] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2449.codfw.wmnet with OS bullseye [16:35:23] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2088.codfw.wmnet with reason: host reimage [16:36:10] (03CR) 10DCausse: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [16:38:26] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2088.codfw.wmnet with reason: host reimage [16:38:29] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Etherpad need restore to previous revision - https://phabricator.wikimedia.org/T356376 (10Dzahn) @SCP-2000 done! I restored to revision 4997. There are already new edits but there is content now. [16:38:41] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bullseye [16:38:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host elastic2094.codfw.wmnet wit... [16:39:10] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Etherpad need restore to previous revision - https://phabricator.wikimedia.org/T356376 (10Dzahn) 05Open→03Resolved [16:40:50] (03PS1) 10Volans: reqconfig: add command to search IP in ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) [16:42:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) I updated the system BIOS on elastic2094 from version 1.11.2 to version 1.12.1 but it didn't make any difference to t... [16:42:44] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2094.codfw.wmnet with OS bullseye [16:44:14] (03CR) 10CI reject: [V: 04-1] reqconfig: add command to search IP in ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans) [16:46:26] (03PS1) 10Arturo Borrero Gonzalez: Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 [16:47:30] (03CR) 10David Caro: [C: 03+1] Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:47:42] (03CR) 10David Caro: [C: 03+2] Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:47:46] (03CR) 10Andrew Bogott: [C: 03+2] Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:47:48] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:47:51] (03CR) 10FNegri: [C: 03+1] Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:49:26] (03PS2) 10Arturo Borrero Gonzalez: Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 [16:49:55] (03PS3) 10David Caro: Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:50:26] (03CR) 10David Caro: [C: 03+2] Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:50:59] (03CR) 10David Caro: [V: 03+2 C: 03+2] Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:51:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10BTullis) a:05BTullis→03None [16:51:18] (03PS2) 10Volans: reqconfig: add command to search IP in ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) [16:53:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db1144.eqiad.wmnet onto db1244.eqiad.wmnet [16:54:17] (03CR) 10CI reject: [V: 04-1] reqconfig: add command to search IP in ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans) [16:54:35] (03PS2) 10Ayounsi: Routed Ganeti: enable IPv6 forwarding [puppet] - 10https://gerrit.wikimedia.org/r/994997 (https://phabricator.wikimedia.org/T300152) [16:55:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2088.codfw.wmnet with OS bullseye [16:55:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Hardware error on elastic2094 - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T355830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host elastic2088.codfw.wmnet with OS... [16:55:25] (03CR) 10Volans: "This is a draft and I can't run the tests locally. So will look at them only if there is an agreement on the feature." [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans) [17:00:05] jhathaway and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240201T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:06] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10AndrewTavis_WMDE) Thank you for the continued attention here, @mpopov. Final investigations of this infra... [17:04:59] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: prometheus-node-exporter errors on firewall-running.prom content - https://phabricator.wikimedia.org/T356305 (10Volans) 05Open→03Resolved a:03Volans All the hosts have stopped logging the error. Resolving. [17:05:10] (03PS1) 10Jdlrobson: [WIP] Deploy taglines and wordmark to wiktionary and wikisource projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995058 (https://phabricator.wikimedia.org/T349036) [17:05:53] (03CR) 10CI reject: [V: 04-1] [WIP] Deploy taglines and wordmark to wiktionary and wikisource projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995058 (https://phabricator.wikimedia.org/T349036) (owner: 10Jdlrobson) [17:07:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T355609)', diff saved to https://phabricator.wikimedia.org/P56075 and previous config saved to /var/cache/conftool/dbconfig/20240201-170722-marostegui.json [17:07:26] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:10:23] (03PS1) 10Tchanders: Set $wgEnablePartialActionBlocks true for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995059 (https://phabricator.wikimedia.org/T353495) [17:10:31] (03PS1) 10Scott French: httpbb: manage test timer presence unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/995060 [17:21:03] (03PS2) 10Scott French: httpbb: manage test timer presence unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/995060 [17:21:42] (03PS6) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) [17:22:17] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [17:22:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P56076 and previous config saved to /var/cache/conftool/dbconfig/20240201-172228-marostegui.json [17:26:46] (03PS3) 10BCornwall: ncredir: Set fifo_log_demux/nginx as wanted_by [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) [17:33:45] (03PS1) 10Clément Goubert: calico: Bump wikikube kube-controllers memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/995063 [17:34:31] (03PS1) 10Eevans: cassandra: break-fix image suggestions Cassandra access [puppet] - 10https://gerrit.wikimedia.org/r/995064 (https://phabricator.wikimedia.org/T356400) [17:35:23] (03CR) 10Dreamy Jazz: [C: 03+1] Set $wgEnablePartialActionBlocks true for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995059 (https://phabricator.wikimedia.org/T353495) (owner: 10Tchanders) [17:37:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P56077 and previous config saved to /var/cache/conftool/dbconfig/20240201-173735-marostegui.json [17:40:04] (03CR) 10Eevans: [C: 03+2] cassandra: break-fix image suggestions Cassandra access [puppet] - 10https://gerrit.wikimedia.org/r/995064 (https://phabricator.wikimedia.org/T356400) (owner: 10Eevans) [17:45:35] (03PS7) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) [17:45:58] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [17:46:47] (03CR) 10CI reject: [V: 04-1] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [17:47:40] (03PS1) 10Eevans: cassandra: fixup copypasta error (role name) [puppet] - 10https://gerrit.wikimedia.org/r/995066 [17:47:44] (03PS8) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) [17:48:23] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [17:51:23] (03CR) 10Eevans: [C: 03+2] cassandra: fixup copypasta error (role name) [puppet] - 10https://gerrit.wikimedia.org/r/995066 (owner: 10Eevans) [17:52:14] (03PS9) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) [17:52:37] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [17:52:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T355609)', diff saved to https://phabricator.wikimedia.org/P56078 and previous config saved to /var/cache/conftool/dbconfig/20240201-175241-marostegui.json [17:52:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2172.codfw.wmnet with reason: Maintenance [17:52:46] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:52:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2172.codfw.wmnet with reason: Maintenance [17:53:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T355609)', diff saved to https://phabricator.wikimedia.org/P56079 and previous config saved to /var/cache/conftool/dbconfig/20240201-175303-marostegui.json [17:59:42] (03CR) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:00:04] bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240201T1800). [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240201T1800) [18:00:57] I probably do have something to deploy, but I'm in a meeting for a while. I'll look later. :) [18:05:26] (03CR) 10Effie Mouzeli: "ack, will do at a later time since for the time being and while testing, the variable will be set under .Values.config.public (I43c4370e3c" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [18:07:20] (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-01-29-122435-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/995067 [18:08:42] 10SRE, 10SRE-swift-storage: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412 (10jijiki) [18:11:33] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2024-01-29-122435-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/995067 (owner: 10BryanDavis) [18:12:30] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-01-29-122435-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/995067 (owner: 10BryanDavis) [18:15:28] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:16:27] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:16:42] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:17:13] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:17:19] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:17:43] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:18:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T355609)', diff saved to https://phabricator.wikimedia.org/P56081 and previous config saved to /var/cache/conftool/dbconfig/20240201-181837-marostegui.json [18:18:42] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:19:13] !log aokoth@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM vrts2001.codfw.wmnet [18:19:30] !log aokoth@cumin2002 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM vrts2001.codfw.wmnet [18:20:13] !log aokoth@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM vrts2001.codfw.wmnet [18:24:02] !log aokoth@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM vrts2001.codfw.wmnet [18:29:32] (03PS1) 10Ssingh: wikimedia-dns.org: dummy commit (update comment) [dns] - 10https://gerrit.wikimedia.org/r/995073 [18:30:39] (03CR) 10Ssingh: [C: 03+2] wikimedia-dns.org: dummy commit (update comment) [dns] - 10https://gerrit.wikimedia.org/r/995073 (owner: 10Ssingh) [18:30:57] (03CR) 10Dzahn: [C: 03+1] "lgtm, you should add the verb form here though ;) https://en.wiktionary.org/wiki/anycast#Noun" [dns] - 10https://gerrit.wikimedia.org/r/995073 (owner: 10Ssingh) [18:32:31] !log running dummy authdns-update [18:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P56082 and previous config saved to /var/cache/conftool/dbconfig/20240201-183343-marostegui.json [18:34:09] (03CR) 10Dzahn: [C: 03+2] ci: fetch tags for git mirrors [puppet] - 10https://gerrit.wikimedia.org/r/994685 (https://phabricator.wikimedia.org/T252310) (owner: 10Hashar) [18:38:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db1146.eqiad.wmnet onto db1246.eqiad.wmnet [18:40:07] mutante: ha thanks, we were talking about running authdns-update as part of onboarding and this was a dummy commit for that [18:41:02] 10SRE, 10serviceops: Memcached, mcrouter in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) [18:41:20] (03CR) 10EoghanGaffney: [C: 03+1] vrts: enable connection pooling [puppet] - 10https://gerrit.wikimedia.org/r/988679 (owner: 10AOkoth) [18:41:56] (03CR) 10EoghanGaffney: [C: 03+1] "It would be nice to do something like having puppet selectively uncomment the correct lines in case znuny changes this file in future, but" [puppet] - 10https://gerrit.wikimedia.org/r/988679 (owner: 10AOkoth) [18:48:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P56083 and previous config saved to /var/cache/conftool/dbconfig/20240201-184850-marostegui.json [18:50:13] (03PS8) 10AOkoth: vrts: enable connection pooling [puppet] - 10https://gerrit.wikimedia.org/r/988679 [18:52:52] (03CR) 10AOkoth: [C: 03+2] vrts: enable connection pooling [puppet] - 10https://gerrit.wikimedia.org/r/988679 (owner: 10AOkoth) [19:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240201T1900) [19:00:39] o/ [19:00:56] (03PS6) 10Dbrant: Add testwiki config to test Contact page for account vanishing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993718 (https://phabricator.wikimedia.org/T343536) [19:01:45] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995086 (https://phabricator.wikimedia.org/T354434) [19:01:47] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995086 (https://phabricator.wikimedia.org/T354434) (owner: 10TrainBranchBot) [19:02:37] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995086 (https://phabricator.wikimedia.org/T354434) (owner: 10TrainBranchBot) [19:03:18] (03PS7) 10Dbrant: Add testwiki config to test Contact page for account vanishing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993718 (https://phabricator.wikimedia.org/T343536) [19:03:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T355609)', diff saved to https://phabricator.wikimedia.org/P56084 and previous config saved to /var/cache/conftool/dbconfig/20240201-190357-marostegui.json [19:04:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2179.codfw.wmnet with reason: Maintenance [19:04:02] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:04:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2179.codfw.wmnet with reason: Maintenance [19:04:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T355609)', diff saved to https://phabricator.wikimedia.org/P56085 and previous config saved to /var/cache/conftool/dbconfig/20240201-190419-marostegui.json [19:12:36] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.16 refs T354434 [19:12:42] T354434: 1.42.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T354434 [19:16:25] (03PS8) 10Dbrant: Add testwiki config to test Contact page for account vanishing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993718 (https://phabricator.wikimedia.org/T343536) [19:17:56] (03CR) 10JHathaway: [C: 04-1] "I'm not sure if there is much value in adding DMARC records for subdomains, unless their policy differs from the org domain. In this case " [dns] - 10https://gerrit.wikimedia.org/r/994864 (https://phabricator.wikimedia.org/T355776) (owner: 10Dzahn) [19:20:14] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355937 (10KFrancis) The NDA has been signed. Thanks! [19:21:29] (03CR) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:27:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T355609)', diff saved to https://phabricator.wikimedia.org/P56086 and previous config saved to /var/cache/conftool/dbconfig/20240201-192745-marostegui.json [19:27:49] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:42:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P56087 and previous config saved to /var/cache/conftool/dbconfig/20240201-194251-marostegui.json [19:43:55] (03PS1) 10Ryan Kemper: wdqs: whitelist MiMoTextBase SPARQL endpoint [puppet] - 10https://gerrit.wikimedia.org/r/995090 (https://phabricator.wikimedia.org/T351488) [19:45:06] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: whitelist MiMoTextBase SPARQL endpoint [puppet] - 10https://gerrit.wikimedia.org/r/995090 (https://phabricator.wikimedia.org/T351488) (owner: 10Ryan Kemper) [19:52:04] (03CR) 10Gehel: [C: 03+1] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:57:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P56088 and previous config saved to /var/cache/conftool/dbconfig/20240201-195758-marostegui.json [20:07:30] (03PS3) 10Hashar: Add rename-project plugin [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/995035 (https://phabricator.wikimedia.org/T201953) [20:08:51] (03PS4) 10Hashar: Add rename-project plugin [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/995035 (https://phabricator.wikimedia.org/T201953) [20:09:15] (03CR) 10Hashar: "Paladox: do you know anything about the rename-plugin? 😎" [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/995035 (https://phabricator.wikimedia.org/T201953) (owner: 10Hashar) [20:12:26] (03CR) 10Paladox: "I don't know too much about it other than it renames the repo :/" [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/995035 (https://phabricator.wikimedia.org/T201953) (owner: 10Hashar) [20:13:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T355609)', diff saved to https://phabricator.wikimedia.org/P56089 and previous config saved to /var/cache/conftool/dbconfig/20240201-201304-marostegui.json [20:13:11] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [20:47:17] (03CR) 10CDanis: "I love the idea of this feature" [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240201T2100). [21:00:05] varnent, Tchanders, sharvani__, and dbrant: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:17] i can deploy today [21:00:21] Here for deployment o/ thank you [21:00:42] (03PS3) 10Sharvaniharan: New stream config for mobileapps Places feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988108 (https://phabricator.wikimedia.org/T351165) [21:00:48] Tchanders: dbrant: varnent: hi, around? :) [21:00:54] present [21:01:15] (03CR) 10Urbanecm: [C: 03+2] New stream config for mobileapps Places feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988108 (https://phabricator.wikimedia.org/T351165) (owner: 10Sharvaniharan) [21:01:56] (03Merged) 10jenkins-bot: New stream config for mobileapps Places feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988108 (https://phabricator.wikimedia.org/T351165) (owner: 10Sharvaniharan) [21:02:11] dbrant: is it intentional that the `AccountVanishRequests` recipient does not exist? [21:02:51] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:988108|New stream config for mobileapps Places feature (T351165)]] [21:02:56] T351165: Places: Create a stream for the feature - https://phabricator.wikimedia.org/T351165 [21:03:35] dbrant: plus, are you really really sure Contact pages are okay to be used for this purpose? note that mail sent via the contact page are not authenticated, so you can't be sure the email comes from whichever account it claims it does [21:04:12] (if you plan to add that capability, great, let's do it. but if not, i'd like to make sure this aspect of contact pages is known to you and your team) [21:04:14] !log urbanecm@deploy2002 sharvaniharan and urbanecm: Backport for [[gerrit:988108|New stream config for mobileapps Places feature (T351165)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:04:30] sharvani__: your patch is at mwdebug if you wanna take a look [21:04:43] testing now.. [21:05:20] Looks good. Thank you. [21:05:23] !log urbanecm@deploy2002 sharvaniharan and urbanecm: Continuing with sync [21:05:26] syncing, thanks [21:05:54] urbanecm: thanks - yep, this is all known and accounted for. The present patch is for testing purposes of a workflow that is still being finalized, and may not be the final state. (cc Seddon) [21:06:05] okay, fair enough. just making sure. [21:06:11] want to avoid confusions :) [21:06:36] dbrant: does the "known and accounted for" apply to the fact the target account does not exist as well? [21:07:15] urbanecm: it does now :) [21:07:33] all concerns resolved then :). thanks for your patience. [21:07:45] (03CR) 10Urbanecm: [C: 03+2] Add testwiki config to test Contact page for account vanishing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993718 (https://phabricator.wikimedia.org/T343536) (owner: 10Dbrant) [21:08:31] (03Merged) 10jenkins-bot: Add testwiki config to test Contact page for account vanishing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993718 (https://phabricator.wikimedia.org/T343536) (owner: 10Dbrant) [21:12:12] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:988108|New stream config for mobileapps Places feature (T351165)]] (duration: 09m 21s) [21:12:20] T351165: Places: Create a stream for the feature - https://phabricator.wikimedia.org/T351165 [21:12:25] Tchanders: varnent: are you around? [21:12:40] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:993718|Add testwiki config to test Contact page for account vanishing. (T343536)]] [21:12:44] T343536: [M] Create v1 of Special:Contact page for account vanish requests - https://phabricator.wikimedia.org/T343536 [21:14:06] !log urbanecm@deploy2002 urbanecm and dbrant: Backport for [[gerrit:993718|Add testwiki config to test Contact page for account vanishing. (T343536)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:14:14] dbrant: can you test? [21:15:16] urbanecm: works! ty [21:15:20] awesome [21:15:21] !log urbanecm@deploy2002 urbanecm and dbrant: Continuing with sync [21:15:22] proceeding [21:15:46] (03PS2) 10BryanDavis: Stop building buster based images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991594 (https://phabricator.wikimedia.org/T287900) (owner: 10Majavah) [21:16:07] (03CR) 10BryanDavis: [C: 03+2] Stop building buster based images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991594 (https://phabricator.wikimedia.org/T287900) (owner: 10Majavah) [21:16:42] (03Merged) 10jenkins-bot: Stop building buster based images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991594 (https://phabricator.wikimedia.org/T287900) (owner: 10Majavah) [21:19:48] Tchanders: varnent: last call, are you around for your deployment? [21:21:50] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:993718|Add testwiki config to test Contact page for account vanishing. (T343536)]] (duration: 09m 10s) [21:21:55] dbrant: deployed [21:22:05] T343536: [M] Create v1 of Special:Contact page for account vanish requests - https://phabricator.wikimedia.org/T343536 [21:22:09] and since neither Tchanders or varnent are here, this concludes the window i think. [21:23:49] thx! [21:25:18] Hi I'm here, kid finally asleep. Anyone object if I do my deployments? [21:25:39] Tchanders: feel free to go ahead, rest is done (except varnent, who's not here it seems) [21:26:14] urbanecm: Thanks [21:27:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tchanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995059 (https://phabricator.wikimedia.org/T353495) (owner: 10Tchanders) [21:28:36] (03Merged) 10jenkins-bot: Set $wgEnablePartialActionBlocks true for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995059 (https://phabricator.wikimedia.org/T353495) (owner: 10Tchanders) [21:28:49] !log tchanders@deploy2002 Started scap: Backport for [[gerrit:995059|Set $wgEnablePartialActionBlocks true for most wikis (T353495)]] [21:28:55] T353495: Deploy partial action blocks on all wikis except the top 6 - https://phabricator.wikimedia.org/T353495 [21:30:13] !log tchanders@deploy2002 tchanders: Backport for [[gerrit:995059|Set $wgEnablePartialActionBlocks true for most wikis (T353495)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:32:38] !log tchanders@deploy2002 tchanders: Continuing with sync [21:38:54] !log tchanders@deploy2002 Finished scap: Backport for [[gerrit:995059|Set $wgEnablePartialActionBlocks true for most wikis (T353495)]] (duration: 10m 04s) [21:39:00] T353495: Deploy partial action blocks on all wikis except the top 6 - https://phabricator.wikimedia.org/T353495 [21:46:28] (03CR) 10Bking: [C: 03+2] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/995041 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:46:50] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355937 (10Dzahn) a:05WMDECyn→03None [21:49:19] (03PS1) 10Ahmon Dancy: Temporarily enable Dockerfile frontend on trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/995103 (https://phabricator.wikimedia.org/T356418) [21:52:47] (03PS1) 10Dzahn: admin: add wmdecyn to ldap_only_admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/995106 (https://phabricator.wikimedia.org/T355937) [21:54:02] (03PS2) 10Dzahn: admin: add wmdecyn to ldap_only_admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/995106 (https://phabricator.wikimedia.org/T355937) [21:57:23] !log LDAP - added wmdecyn to wmde and nda groups T355937 [21:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:27] T355937: Grant Access to for - https://phabricator.wikimedia.org/T355937 [21:59:21] RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-25 15:17:04 +0000 (expires in 83 days) https://wikitech.wikimedia.org/wiki/Search [21:59:39] RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-25 15:17:04 +0000 (expires in 83 days) https://wikitech.wikimedia.org/wiki/Search [21:59:41] RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-25 15:17:04 +0000 (expires in 83 days) https://wikitech.wikimedia.org/wiki/Search [21:59:47] (03CR) 10Dzahn: "I already added them to the 2 groups after KFrancis confirmed NDA was signed." [puppet] - 10https://gerrit.wikimedia.org/r/995106 (https://phabricator.wikimedia.org/T355937) (owner: 10Dzahn) [21:59:55] RECOVERY - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-25 15:17:04 +0000 (expires in 83 days) https://wikitech.wikimedia.org/wiki/Search [22:05:57] RECOVERY - Elasticsearch HTTPS for cloudelastic-chi-eqiad on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-25 15:17:04 +0000 (expires in 83 days) https://wikitech.wikimedia.org/wiki/Search [22:08:19] RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-25 15:17:04 +0000 (expires in 83 days) https://wikitech.wikimedia.org/wiki/Search [22:10:40] 10SRE, 10Wikimedia-Mailing-lists: Create a mail list for METU Northern Cyprus Campus Wikipedia Society - https://phabricator.wikimedia.org/T352892 (10Dzahn) I got bold here and created this as "metu-wikipedia-society" with the 3 owners listed. ` [lists1001:~] $ sudo mailman-wrapper create -o e259163@metu.edu... [22:11:36] (03PS1) 10Bking: cloudelastic: remove soon-to-be-defunct hostnames from SNI [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617) [22:11:41] 10SRE, 10Wikimedia-Mailing-lists: Create a mail list for METU Northern Cyprus Campus Wikipedia Society - https://phabricator.wikimedia.org/T352892 (10Dzahn) 05Open→03Resolved a:03Dzahn All 3 users should have received email about the list creation now. [22:12:44] (HaproxyUnavailable) firing: (2) HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:12:44] (VarnishUnavailable) firing: (2) varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [22:15:26] (03PS1) 10Scott French: P:httpbb: migrate tests from cumin1001 to cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) [22:15:28] (03PS1) 10Scott French: P:httpbb: clean up after move from cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/995109 (https://phabricator.wikimedia.org/T356054) [22:16:45] 10SRE, 10MediaWiki-General, 10Traffic: Advance declaration of query parameters - https://phabricator.wikimedia.org/T310087 (10ori) In lieu of exporting a route map, MediaWiki could, as a first pass at the problem, emit a response header that signals to the CDN that a request contained garbage parameters. The... [22:17:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:17:44] (VarnishUnavailable) resolved: (2) varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [22:17:44] (HaproxyUnavailable) resolved: (2) HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:20:30] (03Abandoned) 10Dzahn: add DMARC record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/994864 (https://phabricator.wikimedia.org/T355776) (owner: 10Dzahn) [22:25:21] (03PS1) 10Bking: cloudelastic: Add private IP canary back to load balancer pool [puppet] - 10https://gerrit.wikimedia.org/r/995110 (https://phabricator.wikimedia.org/T355617) [22:30:16] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995110 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:33:35] (03CR) 10Scott French: "This is the first in a two-part chain that should move the httpbb tests from cumun1001 to cumin1002." [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [22:37:39] (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: Add private IP canary back to load balancer pool [puppet] - 10https://gerrit.wikimedia.org/r/995110 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:37:56] (03Abandoned) 10Scott French: httpbb: manage test timer presence unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/995060 (owner: 10Scott French) [22:42:54] (03CR) 10Bking: [C: 03+2] cloudelastic: Add private IP canary back to load balancer pool [puppet] - 10https://gerrit.wikimedia.org/r/995110 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:51:36] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cloudelastic1010. [22:51:55] !log bking@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=cloudelastic,name=cloudelastic1003.wikimedia.org [22:52:22] !log bking@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=cloudelastic,name=cloudelastic1010.eqiad.wmnet [22:54:14] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cloudelastic1010.eqiad.wmnet [23:22:50] (03CR) 10Cwhite: [C: 03+2] logging::collector: add mw accesslog sampling by benthos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993476 (https://phabricator.wikimedia.org/T355836) (owner: 10Cwhite)