[00:00:33] (03CR) 10Dzahn: [C: 03+2] phorge: git clone arcanist also from we.phorge.it, not Phacility [puppet] - 10https://gerrit.wikimedia.org/r/887431 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [00:02:53] (03CR) 10Dzahn: [C: 03+2] "using this arcanist version fixed other problems I was running into during setup :)" [puppet] - 10https://gerrit.wikimedia.org/r/887431 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [00:04:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:04:25] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:06:29] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mw2425'] [00:07:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mw2424'] [00:09:08] (03PS1) 10Dzahn: phorge: mvove httpd setup to profile and don't call it apache [puppet] - 10https://gerrit.wikimedia.org/r/887432 (https://phabricator.wikimedia.org/T328595) [00:09:21] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:09:36] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add netmon1003 to the ganeti rapi nodes list [puppet] - 10https://gerrit.wikimedia.org/r/887409 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [00:09:43] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:12:59] (03PS1) 10Dzahn: phorge: install php-mbstring, php-curl and php-mysql modules [puppet] - 10https://gerrit.wikimedia.org/r/887433 (https://phabricator.wikimedia.org/T328595) [00:16:20] (03CR) 10Dzahn: [C: 03+1] "response from Affcom: "After discussing the issue of creating a Wikimedia wiki for the Azerbaijani UG, AffCom decided to give the green li" [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) (owner: 10Dzahn) [00:16:54] (03CR) 10Dzahn: [C: 03+1] add az.wikimedia.org for Azerbaijani Wikimedians User Group (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) (owner: 10Dzahn) [00:17:17] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2426'] [00:20:30] (03CR) 10Zabe: [C: 03+1] add az.wikimedia.org for Azerbaijani Wikimedians User Group [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) (owner: 10Dzahn) [00:22:07] (03CR) 10Dzahn: [C: 03+1] "thanks Zabe, planning to merge tomorrow unless any concerns are brought up" [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) (owner: 10Dzahn) [00:22:46] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2427'] [00:26:51] Amir1: For your information, at Spanish Wikipedia we have decided to protect the front page elements with an anti-abuse filter, so we have removed the cascade protection that you implemented as an emergency action. If you have any questions, we can chat: #wikipedia-es-biblios https://es.wikipedia.org/wiki/Especial:FiltroAntiAbusos/136 [00:26:54] (03PS1) 10Zabe: Add Apache configuration for azwikimedia [puppet] - 10https://gerrit.wikimedia.org/r/887434 (https://phabricator.wikimedia.org/T306015) [00:27:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mw2426'] [00:32:03] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2428'] [00:32:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mw2427'] [00:32:59] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2429'] [00:39:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mw2428'] [00:39:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mw2429'] [00:43:39] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2430'] [00:43:49] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2431'] [00:47:24] 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q3): Decommission netmon2001 - https://phabricator.wikimedia.org/T322695 (10Papaul) a:03Jhancock.wm [00:47:34] 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q3): Decommission netmon2001 - https://phabricator.wikimedia.org/T322695 (10Papaul) p:05Triage→03Medium [00:49:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mw2430'] [00:50:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mw2431'] [00:52:00] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2432'] [00:52:09] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2433'] [00:58:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mw2432'] [01:00:19] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2434'] [01:00:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mw2433'] [01:00:53] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2435'] [01:07:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mw2434'] [01:07:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mw2435'] [01:08:48] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [01:37:34] (03PS1) 10Papaul: Add new mw node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/887438 (https://phabricator.wikimedia.org/T326362) [01:38:36] (03CR) 10Papaul: [C: 03+2] Add new mw node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/887438 (https://phabricator.wikimedia.org/T326362) (owner: 10Papaul) [01:43:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2420.codfw.wmnet with OS buster [01:43:23] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster [01:51:19] (03PS1) 10Andrew Bogott: cinder-volume.conf.erb: remove an erb typo [puppet] - 10https://gerrit.wikimedia.org/r/887439 [01:51:43] (03CR) 10CI reject: [V: 04-1] cinder-volume.conf.erb: remove an erb typo [puppet] - 10https://gerrit.wikimedia.org/r/887439 (owner: 10Andrew Bogott) [01:59:09] (03PS2) 10Andrew Bogott: cinder-volume.conf.erb: remove an erb typo [puppet] - 10https://gerrit.wikimedia.org/r/887439 (https://phabricator.wikimedia.org/T324729) [02:00:31] (03CR) 10Andrew Bogott: [C: 03+2] cinder-volume.conf.erb: remove an erb typo [puppet] - 10https://gerrit.wikimedia.org/r/887439 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [02:10:46] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:46] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:49:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:58:36] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2420.codfw.wmnet with OS buster [02:58:42] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster executed with errors: - mw2420 (**FAIL**) - Remove... [02:58:48] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) @MoritzMuehlenhoff I am trying to get Buster on those PE R450 it looks like we are missing some drivers. (PERC H745 Controller,) Thanks... [03:00:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:50] 10SRE, 10API Platform, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data point - https://phabricator.wikimedia.org/T324675 (10Tgr) One thing to keep in m... [03:52:31] (03PS1) 10Andrew Bogott: cinder::volume: include 'tgt' package on hosts [puppet] - 10https://gerrit.wikimedia.org/r/887441 (https://phabricator.wikimedia.org/T324729) [03:57:57] (03CR) 10Andrew Bogott: [C: 03+2] cinder::volume: include 'tgt' package on hosts [puppet] - 10https://gerrit.wikimedia.org/r/887441 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [04:35:55] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (install6002), Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:55:51] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm) [05:36:39] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 117 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:26:41] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "This would be highly disruptive both during debugging things in production and when building e.g. docker images and when running CI in gen" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [06:28:35] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T328825 (10Marostegui) 05Open→03Resolved Thank you John - the RAID is back to optimal [06:30:19] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887357 [06:30:27] (03CR) 10CI reject: [V: 04-1] Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887357 (owner: 10Marostegui) [06:31:47] (03Abandoned) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887357 (owner: 10Marostegui) [06:33:24] (03PS1) 10Marostegui: ProductionServices.php: Promote pc2011 back to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887445 [06:35:39] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2011 back to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887445 (owner: 10Marostegui) [06:36:18] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2011 back to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887445 (owner: 10Marostegui) [06:36:55] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:887445|ProductionServices.php: Promote pc2011 back to pc1 master]] [06:37:25] (03PS1) 10Marostegui: pc2011: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887446 [06:37:49] (03CR) 10Marostegui: [C: 03+2] pc2011: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887446 (owner: 10Marostegui) [06:38:49] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:887445|ProductionServices.php: Promote pc2011 back to pc1 master]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [06:39:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [06:39:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [06:39:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: Maintenance [06:40:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: Maintenance [06:40:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [06:40:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [06:40:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1173 (T328817)', diff saved to https://phabricator.wikimedia.org/P43769 and previous config saved to /var/cache/conftool/dbconfig/20230208-064027-marostegui.json [06:40:31] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [06:41:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T328817)', diff saved to https://phabricator.wikimedia.org/P43770 and previous config saved to /var/cache/conftool/dbconfig/20230208-064134-marostegui.json [06:44:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43771 and previous config saved to /var/cache/conftool/dbconfig/20230208-064405-root.json [06:45:53] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Joe) >>! In T300977#7899855, @jbond wrote: >>>! In T300977#7836272, @Volans wrote: >> If I may add my use case too, I woul... [06:47:32] (03PS1) 10Marostegui: cuc_user_cuc_user_text_T328817.py: All databases [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887621 (https://phabricator.wikimedia.org/T328817) [06:48:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [06:48:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [06:48:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: Maintenance [06:48:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: Maintenance [06:51:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [06:51:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [06:51:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1173 (T328817)', diff saved to https://phabricator.wikimedia.org/P43772 and previous config saved to /var/cache/conftool/dbconfig/20230208-065149-marostegui.json [06:51:53] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [06:52:56] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:887445|ProductionServices.php: Promote pc2011 back to pc1 master]] (duration: 16m 01s) [06:52:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T328817)', diff saved to https://phabricator.wikimedia.org/P43773 and previous config saved to /var/cache/conftool/dbconfig/20230208-065257-marostegui.json [06:53:05] (03CR) 10Marostegui: [C: 03+2] cuc_user_cuc_user_text_T328817.py: All databases [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887621 (https://phabricator.wikimedia.org/T328817) (owner: 10Marostegui) [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230208T0700) [07:07:32] !log Install 10.6.12 on pc2014 T329011 [07:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:35] T329011: Compile and package MariaDB 10.4.28 and 10.6.12 - https://phabricator.wikimedia.org/T329011 [07:08:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P43774 and previous config saved to /var/cache/conftool/dbconfig/20230208-070803-marostegui.json [07:16:37] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10Marostegui) Option #5 sounds good. We'd need to do a switchover though for that master whenever we reach the row A eqiad switch... [07:18:24] !log dbmaint deploy schema change on s8 eqiad (with replication) T328807 T328828 [07:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:28] T328828: Remove default on cul_reason_id and cul_reason_plaintext_id in the cu_log table on wmf wikis - https://phabricator.wikimedia.org/T328828 [07:18:28] T328807: Drop cul_reason from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328807 [07:18:46] !log dbmaint deploy schema change on s4 eqiad (with replication) T328807 T328828 [07:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:15] !log dbmaint deploy schema change on s5 eqiad (with replication) T328807 T328828 [07:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P43775 and previous config saved to /var/cache/conftool/dbconfig/20230208-072310-marostegui.json [07:38:00] !log dbmaint deploy schema change on s2 eqiad (with replication) T328807 T328828 [07:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:05] T328828: Remove default on cul_reason_id and cul_reason_plaintext_id in the cu_log table on wmf wikis - https://phabricator.wikimedia.org/T328828 [07:38:05] T328807: Drop cul_reason from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328807 [07:38:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T328817)', diff saved to https://phabricator.wikimedia.org/P43776 and previous config saved to /var/cache/conftool/dbconfig/20230208-073816-marostegui.json [07:38:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:38:20] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [07:38:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:38:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T328817)', diff saved to https://phabricator.wikimedia.org/P43777 and previous config saved to /var/cache/conftool/dbconfig/20230208-073837-marostegui.json [07:43:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T328817)', diff saved to https://phabricator.wikimedia.org/P43778 and previous config saved to /var/cache/conftool/dbconfig/20230208-074357-marostegui.json [07:44:01] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [07:56:31] !log dbmaint deploy schema change on s7 eqiad (with replication) T328807 T328828 [07:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:36] T328828: Remove default on cul_reason_id and cul_reason_plaintext_id in the cu_log table on wmf wikis - https://phabricator.wikimedia.org/T328828 [07:56:36] T328807: Drop cul_reason from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328807 [07:59:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P43779 and previous config saved to /var/cache/conftool/dbconfig/20230208-075903-marostegui.json [07:59:36] !log dbmaint deploy schema change on s1 eqiad (with replication) T328807 T328828 [07:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230208T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:58] !log dbmaint deploy schema change on s3 eqiad (with replication) T328807 T328828 [08:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:02] T328828: Remove default on cul_reason_id and cul_reason_plaintext_id in the cu_log table on wmf wikis - https://phabricator.wikimedia.org/T328828 [08:02:03] T328807: Drop cul_reason from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328807 [08:04:46] 10ops-codfw, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Marostegui) The host is now up (with less memory). I am going to start mariadb to let it catch up. Please let me know before shutting down the host to replace the DIMM so I can stop mariadb again. [08:10:40] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Sporadic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10ayounsi) [08:13:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Sporadic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10ayounsi) Thanks to o11y help, the dashboard is now much more usable. Most of the traffic dropped in iptables are RST packets, so it's now more than sporadic, see... [08:14:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P43780 and previous config saved to /var/cache/conftool/dbconfig/20230208-081410-marostegui.json [08:19:22] (03PS1) 10Muehlenhoff: Extend access for mhoutti [puppet] - 10https://gerrit.wikimedia.org/r/887722 [08:20:18] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Sporadic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10ayounsi) [08:20:31] 10SRE, 10serviceops: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10ayounsi) 05Resolved→03Open Reopening this task as the issue is still happening. Thanks to o11y the dashboard has been refreshed and have more informations (TCP flags, source/dest hostnames).... [08:26:53] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for mhoutti [puppet] - 10https://gerrit.wikimedia.org/r/887722 (owner: 10Muehlenhoff) [08:28:51] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 9 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [08:29:07] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 9 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) We'll depool eqiad I would assume? cc @Joe @akosiaris We'd still need to switchover m1 master (we do have m1 databases but I guess we are not switchin... [08:29:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T328817)', diff saved to https://phabricator.wikimedia.org/P43781 and previous config saved to /var/cache/conftool/dbconfig/20230208-082916-marostegui.json [08:29:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:29:21] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [08:29:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:29:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T328817)', diff saved to https://phabricator.wikimedia.org/P43782 and previous config saved to /var/cache/conftool/dbconfig/20230208-082938-marostegui.json [08:29:44] (03PS1) 10Marostegui: m2-master: Switchover to dbproxy1015 [dns] - 10https://gerrit.wikimedia.org/r/887723 (https://phabricator.wikimedia.org/T329073) [08:36:31] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MoritzMuehlenhoff) [08:38:45] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MoritzMuehlenhoff) [08:41:29] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [08:43:50] (03PS1) 10Marostegui: mariadb: Promote db1159 to m3 mater [puppet] - 10https://gerrit.wikimedia.org/r/887724 (https://phabricator.wikimedia.org/T329141) [08:45:23] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/887724 (https://phabricator.wikimedia.org/T329141) (owner: 10Marostegui) [08:46:42] (03PS1) 10Marostegui: db1159: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887725 (https://phabricator.wikimedia.org/T329141) [08:47:06] (03CR) 10Marostegui: [C: 03+2] db1159: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887725 (https://phabricator.wikimedia.org/T329141) (owner: 10Marostegui) [08:53:50] (03PS2) 10Marostegui: mariadb: Promote db1159 to m3 mater [puppet] - 10https://gerrit.wikimedia.org/r/887724 (https://phabricator.wikimedia.org/T329141) [08:53:59] (03CR) 10CI reject: [V: 04-1] mariadb: Promote db1159 to m3 mater [puppet] - 10https://gerrit.wikimedia.org/r/887724 (https://phabricator.wikimedia.org/T329141) (owner: 10Marostegui) [08:54:17] (03Abandoned) 10Marostegui: mariadb: Promote db1159 to m3 mater [puppet] - 10https://gerrit.wikimedia.org/r/887724 (https://phabricator.wikimedia.org/T329141) (owner: 10Marostegui) [08:54:48] !log installing imagemagick security updates [08:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:55:48] (03PS1) 10Marostegui: db1164: Set to critical [puppet] - 10https://gerrit.wikimedia.org/r/887726 [08:56:14] (03CR) 10Marostegui: [C: 03+2] db1164: Set to critical [puppet] - 10https://gerrit.wikimedia.org/r/887726 (owner: 10Marostegui) [08:59:07] (03PS1) 10Marostegui: mariadb: Promote db1159 to m3 mater [puppet] - 10https://gerrit.wikimedia.org/r/887727 (https://phabricator.wikimedia.org/T329141) [08:59:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:01:09] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/887727 (https://phabricator.wikimedia.org/T329141) (owner: 10Marostegui) [09:01:51] (03CR) 10Volans: "Nice adition! I've left some suggestions inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:03:58] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [09:04:38] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [09:07:31] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: don't page for thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/887325 (owner: 10Filippo Giunchedi) [09:08:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "Modulo the discussion re: pages/alerts for majority up" [puppet] - 10https://gerrit.wikimedia.org/r/887342 (owner: 10Herron) [09:14:06] !log purge user_auth table on grafana1002 - T328784 [09:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:11] T328784: Grafana LDAP sync fails post upgrade - https://phabricator.wikimedia.org/T328784 [09:19:13] (03CR) 10Jcrespo: [C: 03+1] "+1, same heads up as previously, db1159 has currently MIXED row format" [puppet] - 10https://gerrit.wikimedia.org/r/887727 (https://phabricator.wikimedia.org/T329141) (owner: 10Marostegui) [09:21:19] (03CR) 10Marostegui: [C: 04-2] mariadb: Promote db1159 to m3 mater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887727 (https://phabricator.wikimedia.org/T329141) (owner: 10Marostegui) [09:29:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T328817)', diff saved to https://phabricator.wikimedia.org/P43783 and previous config saved to /var/cache/conftool/dbconfig/20230208-092954-marostegui.json [09:29:58] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [09:31:51] (03PS1) 10Hashar: Revert "contint: remove obsolete firewall rules from labs" [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) [09:32:01] (03CR) 10CI reject: [V: 04-1] Revert "contint: remove obsolete firewall rules from labs" [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) (owner: 10Hashar) [09:32:31] 10SRE, 10Data-Engineering-Planning, 10Observability-Alerting, 10Shared-Data-Infrastructure, 10Traffic: Reduce/eliminate false positives for VarnishKafkaNoMessages alert - https://phabricator.wikimedia.org/T324522 (10nfraison) a:05BTullis→03nfraison [09:41:59] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10akosiaris) eqiad will still be depooled for this one. The current timeline for repooling eqiad in on March 8th, 1 day after the proposed timeline on this task. [09:44:53] (03CR) 10Filippo Giunchedi: "I tried building locally with the following and got an error:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [09:45:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P43785 and previous config saved to /var/cache/conftool/dbconfig/20230208-094500-marostegui.json [09:45:42] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [09:48:58] (03PS2) 10Hashar: Revert "contint: remove obsolete firewall rules from labs" [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) [09:49:24] (03PS1) 10Elukey: ores: add per-model metrics and fix label for response codes [puppet] - 10https://gerrit.wikimedia.org/r/887732 (https://phabricator.wikimedia.org/T325763) [09:50:25] (03CR) 10Elukey: "Adding also Filippo to get his insights from the Observability point of view (if the change is feasible or not)." [puppet] - 10https://gerrit.wikimedia.org/r/887732 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [09:50:59] (03PS1) 10Marostegui: instances.yaml: Remove db1096 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/887733 (https://phabricator.wikimedia.org/T329147) [09:51:34] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1096 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/887733 (https://phabricator.wikimedia.org/T329147) (owner: 10Marostegui) [09:52:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1096 (s5,s6) from dbctl T329147', diff saved to https://phabricator.wikimedia.org/P43786 and previous config saved to /var/cache/conftool/dbconfig/20230208-095207-marostegui.json [09:52:10] T329147: decommission db1096.eqiad.wmnet - https://phabricator.wikimedia.org/T329147 [09:52:24] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Sporadic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10fgiunchedi) [09:58:26] (03PS1) 10Marostegui: db1096: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887734 (https://phabricator.wikimedia.org/T329147) [09:58:52] (03CR) 10Marostegui: [C: 03+2] db1096: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887734 (https://phabricator.wikimedia.org/T329147) (owner: 10Marostegui) [09:59:28] (03PS1) 10Btullis: Add some dummy tokens for the airflow_test database [labs/private] - 10https://gerrit.wikimedia.org/r/887735 (https://phabricator.wikimedia.org/T315580) [09:59:58] !log installing openssl security updates on bullseye [10:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:02] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add some dummy tokens for the airflow_test database [labs/private] - 10https://gerrit.wikimedia.org/r/887735 (https://phabricator.wikimedia.org/T315580) (owner: 10Btullis) [10:00:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P43787 and previous config saved to /var/cache/conftool/dbconfig/20230208-100006-marostegui.json [10:00:59] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/887321 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [10:01:38] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39459/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [10:02:00] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [10:04:31] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: move version check to earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/885864 (https://phabricator.wikimedia.org/T328593) [10:07:04] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Test Upgrade GitLab Replica gitlab1003 with invalid version [10:07:10] !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Test Upgrade GitLab Replica gitlab1003 with invalid version [10:08:06] !log phedenskog@deploy1002 Started deploy [performance/navtiming@079891a]: (no justification provided) [10:08:14] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@079891a]: (no justification provided) (duration: 00m 08s) [10:12:54] (03PS1) 10Zabe: Revert "slwiki: Raise AF emergency disable treshold+count" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887747 [10:13:07] (03PS2) 10Zabe: Revert "slwiki: Raise AF emergency disable treshold+count" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887747 (https://phabricator.wikimedia.org/T328366) [10:13:18] (03PS3) 10Zabe: Revert "slwiki: Raise AF emergency disable treshold+count" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887747 (https://phabricator.wikimedia.org/T328366) [10:13:31] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Sporadic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10cmooney) One observation from the dashboard is that the RST's aren't very "sporadic" (as per title of this task). They seem fairly evenly distributed over time a... [10:15:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T328817)', diff saved to https://phabricator.wikimedia.org/P43788 and previous config saved to /var/cache/conftool/dbconfig/20230208-101512-marostegui.json [10:15:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [10:15:16] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [10:15:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [10:15:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T328817)', diff saved to https://phabricator.wikimedia.org/P43789 and previous config saved to /var/cache/conftool/dbconfig/20230208-101534-marostegui.json [10:18:23] jouncebot, nowandnext [10:18:23] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [10:18:23] In 0 hour(s) and 41 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230208T1100) [10:18:28] (03CR) 10Zabe: [C: 03+2] Revert "slwiki: Raise AF emergency disable treshold+count" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887747 (https://phabricator.wikimedia.org/T328366) (owner: 10Zabe) [10:19:12] (03Merged) 10jenkins-bot: Revert "slwiki: Raise AF emergency disable treshold+count" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887747 (https://phabricator.wikimedia.org/T328366) (owner: 10Zabe) [10:19:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T328817)', diff saved to https://phabricator.wikimedia.org/P43790 and previous config saved to /var/cache/conftool/dbconfig/20230208-101948-marostegui.json [10:19:52] !log zabe@deploy1002 Started scap: Backport for [[gerrit:887747|Revert "slwiki: Raise AF emergency disable treshold+count" (T328366)]] [10:20:01] zabe: https://phabricator.wikimedia.org/T329151 should I stop the schema change? [10:20:31] Oh actually I am not dropping that one but cul_reason_id [10:21:41] 10SRE, 10Traffic: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 (10Vgutierrez) a:03Vgutierrez 2.6.6 has been running as expected since the experiment started, next week we plan to upgrade the whole CDN [10:21:45] !log zabe@deploy1002 zabe: Backport for [[gerrit:887747|Revert "slwiki: Raise AF emergency disable treshold+count" (T328366)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [10:22:18] marostegui, it's T328807 and I would have said yes, but it seems to already be done (s3 is mostly irrelevant for cu) [10:22:18] T328807: Drop cul_reason from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328807 [10:22:24] (03CR) 10Ladsgroup: [C: 03+1] m2-master: Switchover to dbproxy1015 [dns] - 10https://gerrit.wikimedia.org/r/887723 (https://phabricator.wikimedia.org/T329073) (owner: 10Marostegui) [10:22:41] (03CR) 10Marostegui: [C: 03+2] m2-master: Switchover to dbproxy1015 [dns] - 10https://gerrit.wikimedia.org/r/887723 (https://phabricator.wikimedia.org/T329073) (owner: 10Marostegui) [10:23:06] so I don't think you can do anything and we need to quickly patch that [10:23:13] zabe: roger [10:24:12] (03CR) 10Jelto: [C: 03+2] install_server: add custom partman config for gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/887330 (https://phabricator.wikimedia.org/T329035) (owner: 10Jelto) [10:24:42] hi zabe, I'm around if you need any help with fixing that. [10:26:03] !log Failover m2-master from dbproxy1013 to dbproxy1015 T329073 [10:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:06] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [10:26:56] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [10:28:18] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39460/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [10:28:41] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:887747|Revert "slwiki: Raise AF emergency disable treshold+count" (T328366)]] (duration: 08m 49s) [10:28:48] (03CR) 10Volans: "From local testing I see only the curator deps to be adjusted, all the rest doesn't seem necessary to me from my local test on 3.9 and 3.1" [software/spicerack] - 10https://gerrit.wikimedia.org/r/886359 (https://phabricator.wikimedia.org/T328775) (owner: 10Jbond) [10:30:30] (03PS1) 10Zabe: Remove cul_reason comment table migration code [extensions/CheckUser] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887748 (https://phabricator.wikimedia.org/T233004) [10:30:35] (03PS2) 10Zabe: Remove cul_reason comment table migration code [extensions/CheckUser] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887748 (https://phabricator.wikimedia.org/T233004) [10:31:14] (03CR) 10Klausman: [C: 03+1] ores: add per-model metrics and fix label for response codes [puppet] - 10https://gerrit.wikimedia.org/r/887732 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [10:31:43] (03PS1) 10Hashar: contint: factor common firewalling rules [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) [10:32:01] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar) [10:32:07] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) (owner: 10Hashar) [10:32:09] (03PS3) 10Zabe: Remove cul_reason comment table migration code [extensions/CheckUser] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887748 (https://phabricator.wikimedia.org/T233004) [10:32:30] (03PS19) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [10:32:40] urbanecm, I think this is already fixed in wmf.22, backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/887748 should be enough [10:33:03] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39461/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [10:33:08] (03CR) 10Elukey: "Thanks a lot for the comments! I tried to fix the most trivial ones in this round of changes, I'll work on the rest in a bit." [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [10:33:13] ack [10:33:31] (03CR) 10CI reject: [V: 04-1] contint: factor common firewalling rules [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar) [10:33:34] (03CR) 10Zabe: [C: 03+2] Remove cul_reason comment table migration code [extensions/CheckUser] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887748 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [10:33:49] !log deploying python3-wmflib_1.2.1 to the fleet [10:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P43791 and previous config saved to /var/cache/conftool/dbconfig/20230208-103455-marostegui.json [10:35:33] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [10:37:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887748 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [10:37:54] (03PS1) 10Volans: cumin: add alias for OS=bookworm [puppet] - 10https://gerrit.wikimedia.org/r/887739 [10:38:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [10:40:11] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) I've extraced the sqlite database files for this container and had a look. To find the scheme one can look at [[ https://github.com/openstack/swift/blob/master/swift/containe... [10:41:40] (03PS3) 10Filippo Giunchedi: opensearch: reverse-proxy access to opensearch API [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) [10:41:56] (03CR) 10Filippo Giunchedi: opensearch: reverse-proxy access to opensearch API (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [10:43:57] (03PS1) 10Giuseppe Lavagetto: sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 [10:43:59] (03PS1) 10Giuseppe Lavagetto: Add --fast-insecure switch for pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/887741 [10:49:22] (03Merged) 10jenkins-bot: Remove cul_reason comment table migration code [extensions/CheckUser] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887748 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [10:49:46] !log zabe@deploy1002 Started scap: Backport for [[gerrit:887748|Remove cul_reason comment table migration code (T233004 T329151)]] [10:49:51] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [10:49:52] T329151: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cul_reason' in 'field list'Function: IndexPager::buildQueryInfo (MediaWiki\CheckUser\CheckUser\Pagers\CheckUserLogPager)Query: SELECT cul_id,cul_timestamp,cul_reason, - https://phabricator.wikimedia.org/T329151 [10:50:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P43793 and previous config saved to /var/cache/conftool/dbconfig/20230208-105001-marostegui.json [10:50:29] (03PS1) 10Elukey: cumin: add more ealiases for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/887743 [10:50:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/887739 (owner: 10Volans) [10:50:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:51:27] (03CR) 10EoghanGaffney: [C: 03+2] Sends the otrs.Daemon.pl log messages to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/887321 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [10:51:36] !log zabe@deploy1002 zabe: Backport for [[gerrit:887748|Remove cul_reason comment table migration code (T233004 T329151)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [10:53:17] (03PS2) 10Elukey: cumin: add more ealiases for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/887743 [10:54:08] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/887743 (owner: 10Elukey) [10:54:57] (03CR) 10Volans: [C: 03+2] cumin: add alias for OS=bookworm [puppet] - 10https://gerrit.wikimedia.org/r/887739 (owner: 10Volans) [10:55:02] (03PS2) 10Giuseppe Lavagetto: sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 [10:55:04] (03PS2) 10Giuseppe Lavagetto: Add --fast-insecure switch for pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/887741 [10:55:39] (03CR) 10Volans: cumin: add more ealiases for the ml-staging cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887743 (owner: 10Elukey) [10:55:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:57:02] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10jbond) @Joe thanks for the input >>! In T300977#8596499, @Joe wrote: > > This would break a lot of workflows, I t would... [10:57:47] (03PS3) 10Elukey: cumin: add more aliases for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/887743 [10:57:50] (03CR) 10Elukey: cumin: add more aliases for the ml-staging cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887743 (owner: 10Elukey) [10:57:52] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:887748|Remove cul_reason comment table migration code (T233004 T329151)]] (duration: 08m 05s) [10:57:56] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [10:57:56] T329151: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cul_reason' in 'field list'Function: IndexPager::buildQueryInfo (MediaWiki\CheckUser\CheckUser\Pagers\CheckUserLogPager)Query: SELECT cul_id,cul_timestamp,cul_reason, - https://phabricator.wikimedia.org/T329151 [10:59:18] (03CR) 10Volans: [C: 03+1] "LGTM, see note inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto) [10:59:25] Special:CheckUserLog at enwiki is working again [11:00:03] (03CR) 10Hnowlan: "lgtm, one query" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230208T1100) [11:00:25] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10jbond) sgmt just ping if/when you need more pointers [11:05:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T328817)', diff saved to https://phabricator.wikimedia.org/P43796 and previous config saved to /var/cache/conftool/dbconfig/20230208-110507-marostegui.json [11:05:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:05:12] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [11:05:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:08:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:10:36] (03PS1) 10Filippo Giunchedi: opensearch_dashboards: enforce memory limit [puppet] - 10https://gerrit.wikimedia.org/r/887767 (https://phabricator.wikimedia.org/T327161) [11:13:12] !log Stop mysql on db1096 (s5,s6) T329147 [11:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:15] T329147: decommission db1096.eqiad.wmnet - https://phabricator.wikimedia.org/T329147 [11:13:40] (03CR) 10Volans: [C: 03+1] "Thanks for this addition. A possible alternative approach inline, not a blocker." [cookbooks] - 10https://gerrit.wikimedia.org/r/887741 (owner: 10Giuseppe Lavagetto) [11:13:56] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:14:14] (03CR) 10Filippo Giunchedi: "It is clear there's a leak, so limit memory at the systemd unit level. I have selected 350M as that should give >20d uptime based on the c" [puppet] - 10https://gerrit.wikimedia.org/r/887767 (https://phabricator.wikimedia.org/T327161) (owner: 10Filippo Giunchedi) [11:14:21] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) > I would maintain that it's more urgent to provide an artifact repository for having local npm/pypi/go packages... [11:18:31] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/887732 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [11:22:10] (03PS20) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [11:24:07] (03CR) 10Filippo Giunchedi: [C: 03+1] ores: add per-model metrics and fix label for response codes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887732 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [11:24:18] (03CR) 10Filippo Giunchedi: [C: 03+1] ores: add per-model metrics and fix label for response codes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887732 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [11:25:35] (03PS21) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [11:25:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:27:02] (03CR) 10Elukey: Add sre.k8s.upgrade-cluster (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:30:42] (03CR) 10Elukey: [C: 03+2] ores: add per-model metrics and fix label for response codes [puppet] - 10https://gerrit.wikimedia.org/r/887732 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [11:30:45] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Arclamp [puppet] - 10https://gerrit.wikimedia.org/r/887769 (https://phabricator.wikimedia.org/T135991) [11:30:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:32:53] jouncebot: nowandnext [11:32:53] For the next 0 hour(s) and 27 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230208T1100) [11:32:53] In 2 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230208T1400) [11:37:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [11:38:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [11:38:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:38:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:38:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T328817)', diff saved to https://phabricator.wikimedia.org/P43797 and previous config saved to /var/cache/conftool/dbconfig/20230208-113832-marostegui.json [11:38:36] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [11:40:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T328817)', diff saved to https://phabricator.wikimedia.org/P43798 and previous config saved to /var/cache/conftool/dbconfig/20230208-114040-marostegui.json [11:45:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/887769 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:46:56] 10Puppet, 10Infrastructure-Foundations: systemd-timer puppet code triggers an execution when applying a schedule change - https://phabricator.wikimedia.org/T329158 (10jcrespo) [11:52:12] 10Puppet, 10Infrastructure-Foundations: systemd-timer puppet code triggers an execution when applying a schedule change - https://phabricator.wikimedia.org/T329158 (10jcrespo) [11:52:48] (03PS1) 10Muehlenhoff: arclamp: Remove rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/887771 (https://phabricator.wikimedia.org/T316223) [11:53:29] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: flowspec1001.eqiad.wmnet [11:53:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: flowspec1001.eqiad.wmnet [11:55:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P43799 and previous config saved to /var/cache/conftool/dbconfig/20230208-115546-marostegui.json [11:56:36] (03PS16) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [11:56:38] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: moss-be1001.eqiad.wmnet [11:56:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: moss-be1001.eqiad.wmnet [11:57:38] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on an-worker1096.eqiad.wmnet with reason: Attempting to move some GPUs [11:57:52] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-worker1096.eqiad.wmnet with reason: Attempting to move some GPUs [11:57:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ad975722-2d29-4e76-b155-59e38bc020f3) set by... [11:57:59] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on an-worker1097.eqiad.wmnet with reason: Attempting to move some GPUs [11:58:01] (03CR) 10Volans: Add sre.k8s.upgrade-cluster (0312 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:58:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-worker1097.eqiad.wmnet with reason: Attempting to move some GPUs [11:58:34] (03CR) 10Slyngshede: [C: 03+2] sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [11:58:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b637e9a2-c8cd-43d2-ab57-1acb06e6d236) set by... [11:59:14] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: Attempting to move some GPUs [11:59:39] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: Attempting to move some GPUs [11:59:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a56d2950-9f5c-4f2c-8fc1-ddb7900637da) set by... [12:00:07] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887751 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [12:00:28] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:38] !log eoghan@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bullseye [12:04:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis) [12:05:48] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:37] (03CR) 10Clément Goubert: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/887774 (owner: 10Clément Goubert) [12:09:58] (KubernetesCalicoDown) firing: dse-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:10:52] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10MoritzMuehlenhoff) >>! In T326362#8596391, @Papaul wrote: > @MoritzMuehlenhoff I am trying to get Buster on those PE R450 it looks like we are missing some drivers. (PERC H745 Cont... [12:10:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P43800 and previous config saved to /var/cache/conftool/dbconfig/20230208-121053-marostegui.json [12:13:08] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kafka-stretch2002.codfw.wmnet [12:13:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kafka-stretch2002.codfw.wmnet [12:15:35] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage [12:16:00] (03CR) 10Volans: [C: 04-1] "Small bug inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/887774 (owner: 10Clément Goubert) [12:18:01] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage [12:18:10] (03PS1) 10Hokwelum: use lbzip2 instead of bzcat to decompress blocks in parallel [puppet] - 10https://gerrit.wikimedia.org/r/887776 [12:18:32] (03CR) 10CI reject: [V: 04-1] use lbzip2 instead of bzcat to decompress blocks in parallel [puppet] - 10https://gerrit.wikimedia.org/r/887776 (owner: 10Hokwelum) [12:19:27] (03PS3) 10Clément Goubert: sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774 [12:19:40] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas [12:19:53] (03CR) 10Clément Goubert: sre.discovery.datacenter: Add progress logging (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/887774 (owner: 10Clément Goubert) [12:20:06] (03PS2) 10Hokwelum: use lbzip2 instead of bzcat to decompress blocks in parallel [puppet] - 10https://gerrit.wikimedia.org/r/887776 (https://phabricator.wikimedia.org/T328804) [12:21:00] (03CR) 10Volans: [C: 03+1] "LGTM, I'll leave it to your team to chime in on the wording/verbosity" [cookbooks] - 10https://gerrit.wikimedia.org/r/887774 (owner: 10Clément Goubert) [12:21:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas [12:24:12] (03PS4) 10Elukey: cumin: add more aliases for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/887743 [12:25:31] (03PS5) 10Elukey: cumin: add more aliases for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/887743 [12:25:46] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887752 (https://phabricator.wikimedia.org/T329168) (owner: 10Superpes15) [12:26:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T328817)', diff saved to https://phabricator.wikimedia.org/P43801 and previous config saved to /var/cache/conftool/dbconfig/20230208-122559-marostegui.json [12:26:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:26:03] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [12:26:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:26:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T328817)', diff saved to https://phabricator.wikimedia.org/P43802 and previous config saved to /var/cache/conftool/dbconfig/20230208-122620-marostegui.json [12:28:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T328817)', diff saved to https://phabricator.wikimedia.org/P43803 and previous config saved to /var/cache/conftool/dbconfig/20230208-122829-marostegui.json [12:29:12] (03PS1) 10Marostegui: mariadb: Decommission db1096 [puppet] - 10https://gerrit.wikimedia.org/r/887778 (https://phabricator.wikimedia.org/T329147) [12:29:45] (03PS3) 10Jbond: tox.ini: pin elasticsearch-curator >=5.0.0,<6 [software/spicerack] - 10https://gerrit.wikimedia.org/r/886359 (https://phabricator.wikimedia.org/T328775) [12:29:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1096.eqiad.wmnet [12:31:23] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1096 [puppet] - 10https://gerrit.wikimedia.org/r/887778 (https://phabricator.wikimedia.org/T329147) (owner: 10Marostegui) [12:31:25] (03CR) 10Jbond: "updated" [software/spicerack] - 10https://gerrit.wikimedia.org/r/886359 (https://phabricator.wikimedia.org/T328775) (owner: 10Jbond) [12:31:54] (03PS4) 10Jbond: tox.ini: pin elasticsearch-curator ~=5.0 * pin elasticsearch-curator ~=5.0 as newer versions cause an error, see T328775 [software/spicerack] - 10https://gerrit.wikimedia.org/r/886359 (https://phabricator.wikimedia.org/T328775) [12:34:01] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [12:34:47] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/886359 (https://phabricator.wikimedia.org/T328775) (owner: 10Jbond) [12:36:12] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1096.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [12:37:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1096.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [12:37:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:37:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1096.eqiad.wmnet [12:37:46] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1096.eqiad.wmnet - https://phabricator.wikimedia.org/T329147 (10Marostegui) [12:38:07] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1096.eqiad.wmnet - https://phabricator.wikimedia.org/T329147 (10Marostegui) a:05Marostegui→03None This is ready for DC-Ops [12:38:21] 10ops-eqiad, 10decommission-hardware: decommission db1096.eqiad.wmnet - https://phabricator.wikimedia.org/T329147 (10Marostegui) [12:43:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P43804 and previous config saved to /var/cache/conftool/dbconfig/20230208-124335-marostegui.json [12:45:34] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable profile::auto_restarts::service for Arclamp [puppet] - 10https://gerrit.wikimedia.org/r/887769 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:45:40] (03CR) 10Filippo Giunchedi: [C: 03+1] arclamp: Remove rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/887771 (https://phabricator.wikimedia.org/T316223) (owner: 10Muehlenhoff) [12:47:53] (03CR) 10Jbond: P:environment: roll out no proxy config to all hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879418 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [12:49:26] (03PS1) 10Nicolas Fraison: fix(varnishkafka): Use rate instead of irate and increase period of VarnishkafkaNoMessages alerts [alerts] - 10https://gerrit.wikimedia.org/r/887780 [12:50:01] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10taavi) [12:50:08] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10taavi) >>! In T133389#2230609, @BBlack wrote: > About constraints, rationales, and paths forward (some of thi... [12:51:24] (03PS2) 10Nicolas Fraison: fix(varnishkafka): Use rate instead of irate and increase period of VarnishkafkaNoMessages alerts [alerts] - 10https://gerrit.wikimedia.org/r/887780 (https://phabricator.wikimedia.org/T324522) [12:58:12] (03CR) 10Clément Goubert: [C: 04-1] "Two bugs inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto) [12:58:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P43805 and previous config saved to /var/cache/conftool/dbconfig/20230208-125841-marostegui.json [12:59:20] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 13150 [13:00:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 13150 [13:02:55] (03PS1) 10Majavah: toolserver_legacy: Add image credits [puppet] - 10https://gerrit.wikimedia.org/r/887781 (https://phabricator.wikimedia.org/T103965) [13:03:07] (03CR) 10Hashar: [C: 03+1] "I did not need any puppet catalog compiler, this change solely affects WMCS instances and I already cherry picked the change on both deplo" [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) (owner: 10Hashar) [13:03:20] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bullseye [13:03:53] (03CR) 10Hashar: [C: 03+1] "There is some duplicate configuration with profile::ci::firewall but I am refactoring that in the follow up change https://gerrit.wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) (owner: 10Hashar) [13:06:44] (03PS2) 10Hashar: contint: factor common firewalling rules [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) [13:07:03] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar) [13:13:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T328817)', diff saved to https://phabricator.wikimedia.org/P43807 and previous config saved to /var/cache/conftool/dbconfig/20230208-131348-marostegui.json [13:13:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:13:52] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [13:14:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:14:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T328817)', diff saved to https://phabricator.wikimedia.org/P43808 and previous config saved to /var/cache/conftool/dbconfig/20230208-131409-marostegui.json [13:15:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:20:16] (03CR) 10Jbond: [C: 03+2] tox.ini: pin elasticsearch-curator ~=5.0 * pin elasticsearch-curator ~=5.0 as newer versions cause an error, see T328775 [software/spicerack] - 10https://gerrit.wikimedia.org/r/886359 (https://phabricator.wikimedia.org/T328775) (owner: 10Jbond) [13:20:34] (03PS24) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [13:20:46] (03CR) 10Jbond: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [13:20:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:21:51] (03CR) 10Hashar: "From https://puppet-compiler.wmflabs.org/output/887738/1603/:" [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar) [13:23:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T328817)', diff saved to https://phabricator.wikimedia.org/P43809 and previous config saved to /var/cache/conftool/dbconfig/20230208-132318-marostegui.json [13:23:22] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [13:24:09] (03CR) 10Jbond: [C: 03+2] wmnet: swap esams and esqin for the puppet CNAME [dns] - 10https://gerrit.wikimedia.org/r/887314 (owner: 10Jbond) [13:24:13] (03PS2) 10Jbond: wmnet: swap esams and esqin for the puppet CNAME [dns] - 10https://gerrit.wikimedia.org/r/887314 [13:27:28] (03PS2) 10Muehlenhoff: arclamp: Remove rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/887771 (https://phabricator.wikimedia.org/T316223) [13:28:28] (03PS1) 10Nicolas Fraison: chore(varnishkafa): add site to VarnishkafkaNoMessages alerts [alerts] - 10https://gerrit.wikimedia.org/r/887784 [13:29:35] (03CR) 10CI reject: [V: 04-1] chore(varnishkafa): add site to VarnishkafkaNoMessages alerts [alerts] - 10https://gerrit.wikimedia.org/r/887784 (owner: 10Nicolas Fraison) [13:29:58] (03CR) 10Muehlenhoff: [C: 03+2] arclamp: Remove rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/887771 (https://phabricator.wikimedia.org/T316223) (owner: 10Muehlenhoff) [13:33:26] !log send puppet.esams.wmnet to eqiad and puppet.esams.wmnet to codfw [13:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:49] !log (correction) send puppet.esams.wmnet to eqiad and puppet.esqin.wmnet to codfw [13:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P43810 and previous config saved to /var/cache/conftool/dbconfig/20230208-133825-marostegui.json [13:43:19] (03CR) 10David Caro: "Got a question" [puppet] - 10https://gerrit.wikimedia.org/r/887372 (https://phabricator.wikimedia.org/T329041) (owner: 10Arturo Borrero Gonzalez) [13:43:26] 10SRE, 10Data-Engineering-Planning, 10Observability-Alerting, 10Traffic, and 2 others: Reduce/eliminate false positives for VarnishKafkaNoMessages alert - https://phabricator.wikimedia.org/T324522 (10EChetty) [13:44:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10MoritzMuehlenhoff) >>! In T159412#8595266, @Dzahn wrote: > As far as I can tell nowadays there is no more node... [13:44:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [13:45:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [13:45:34] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: consolidate extra floating IP routes [puppet] - 10https://gerrit.wikimedia.org/r/887372 (https://phabricator.wikimedia.org/T329041) [13:45:36] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: introduce additional route for VIPs [puppet] - 10https://gerrit.wikimedia.org/r/887373 (https://phabricator.wikimedia.org/T295774) [13:49:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance [13:49:35] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: introduce additional route for VIPs [puppet] - 10https://gerrit.wikimedia.org/r/887373 (https://phabricator.wikimedia.org/T295774) [13:49:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance [13:49:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T328817)', diff saved to https://phabricator.wikimedia.org/P43811 and previous config saved to /var/cache/conftool/dbconfig/20230208-134950-marostegui.json [13:49:54] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [13:49:56] (03CR) 10CI reject: [V: 04-1] cloudgw: introduce additional route for VIPs [puppet] - 10https://gerrit.wikimedia.org/r/887373 (https://phabricator.wikimedia.org/T295774) (owner: 10Arturo Borrero Gonzalez) [13:51:36] 10SRE, 10Observability-Alerting, 10observability: alertmanager silence confirmation page links to localhost - https://phabricator.wikimedia.org/T328869 (10fgiunchedi) Agreed the localhost link isn't user-friendly, and indeed the underlying reason is that the Debian package for AM doesn't ship the UI. What we... [13:52:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T328817)', diff saved to https://phabricator.wikimedia.org/P43812 and previous config saved to /var/cache/conftool/dbconfig/20230208-135216-marostegui.json [13:52:51] (03CR) 10David Caro: [C: 03+1] "LGTM, I'm not very familiar with the content of the change, so mostly acking the puppet code does what seems intended." [puppet] - 10https://gerrit.wikimedia.org/r/887373 (https://phabricator.wikimedia.org/T295774) (owner: 10Arturo Borrero Gonzalez) [13:53:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P43813 and previous config saved to /var/cache/conftool/dbconfig/20230208-135331-marostegui.json [13:54:09] (03CR) 10David Caro: cloudgw: introduce additional route for VIPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887373 (https://phabricator.wikimedia.org/T295774) (owner: 10Arturo Borrero Gonzalez) [13:57:26] (03PS2) 10Nicolas Fraison: chore(varnishkafa): add site to VarnishkafkaNoMessages alerts [alerts] - 10https://gerrit.wikimedia.org/r/887784 [13:58:26] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Improve Homer output when Juniper device rejects config - https://phabricator.wikimedia.org/T328747 (10Volans) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230208T1400) [14:00:05] James_F and Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] Heya. [14:00:23] I can’t deploy, sorry [14:00:31] I can do it. [14:00:35] Hi! I can deploy in about 5 minutes [14:00:41] or you can James_F :) [14:00:47] James_F: I suspected as much :) [14:00:53] at least self-service your patch [14:00:58] idk if you want to do the other one(s) ^^ [14:01:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882705 (owner: 10Jforrester) [14:01:40] I can do them all, probably. [14:01:53] (03PS2) 10Jforrester: Replace wgBetaFeaturesWhitelist with wgBetaFeaturesAllowList, Part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882705 [14:02:01] (03CR) 10TrainBranchBot: "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882705 (owner: 10Jforrester) [14:02:23] It'd be great it scap backport did the trivial rebase itself. [14:02:27] Ah well. [14:02:33] 10SRE, 10Observability-Alerting, 10observability: alertmanager silence confirmation page links to localhost - https://phabricator.wikimedia.org/T328869 (10Volans) Why not linking them back to karma? For example this seems to work: ` https://alerts.wikimedia.org/?q=%40silence_id%3Daa2048cb-3cb7-4046-bcd2-5b50... [14:02:48] (03Merged) 10jenkins-bot: Replace wgBetaFeaturesWhitelist with wgBetaFeaturesAllowList, Part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882705 (owner: 10Jforrester) [14:03:13] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:882705|Replace wgBetaFeaturesWhitelist with wgBetaFeaturesAllowList, Part I]] [14:03:41] Hi :) I'm here! Let me know when you are ready for deploy my patches :) [14:04:16] Superpes: Hopefully soon! [14:04:46] Yep yep no rush :D Just ping me when you're ready :) [14:05:01] (03PS2) 10Jforrester: Replace wgBetaFeaturesWhitelist with wgBetaFeaturesAllowList, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882706 [14:05:04] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:882705|Replace wgBetaFeaturesWhitelist with wgBetaFeaturesAllowList, Part I]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [14:07:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P43815 and previous config saved to /var/cache/conftool/dbconfig/20230208-140722-marostegui.json [14:08:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T328817)', diff saved to https://phabricator.wikimedia.org/P43816 and previous config saved to /var/cache/conftool/dbconfig/20230208-140837-marostegui.json [14:08:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [14:08:41] (03CR) 10Muehlenhoff: [C: 04-1] "Please don't add any headers without the Rake job, this will only lead to inconsistencies." [puppet] - 10https://gerrit.wikimedia.org/r/887382 (owner: 10Dzahn) [14:08:41] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [14:08:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [14:08:53] (03PS6) 10Ayounsi: Netbox: add support for central Redis [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) [14:08:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T328817)', diff saved to https://phabricator.wikimedia.org/P43817 and previous config saved to /var/cache/conftool/dbconfig/20230208-140859-marostegui.json [14:09:04] 10SRE, 10SRE-Access-Requests: Request for SSH Access for kofori - https://phabricator.wikimedia.org/T328787 (10KOfori) Hi @Dzahn, that's correct. Global root access on all machines. Sorry for the late response. Missed the notification. [14:09:25] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [14:09:50] Oy, scap backport is /so/ slow. [14:10:25] (03PS1) 10Joal: Update analytics data purge for webrequest_actor [puppet] - 10https://gerrit.wikimedia.org/r/887786 (https://phabricator.wikimedia.org/T324483) [14:10:27] It used to take ~40 seconds to scap out a trivial change to IS. [14:11:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T328817)', diff saved to https://phabricator.wikimedia.org/P43818 and previous config saved to /var/cache/conftool/dbconfig/20230208-141106-marostegui.json [14:11:11] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2312 is CRITICAL: etcd last index (2395755) is outdated compared to the master one (2396837) https://wikitech.wikimedia.org/wiki/Etcd [14:11:11] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1428 is CRITICAL: etcd last index (1582970) is outdated compared to the master one (1583511) https://wikitech.wikimedia.org/wiki/Etcd [14:11:13] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2377 is CRITICAL: etcd last index (2395755) is outdated compared to the master one (2396837) https://wikitech.wikimedia.org/wiki/Etcd [14:11:13] PROBLEM - MediaWiki EtcdConfig up-to-date on parse1013 is CRITICAL: etcd last index (1582970) is outdated compared to the master one (1583511) https://wikitech.wikimedia.org/wiki/Etcd [14:11:27] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:882705|Replace wgBetaFeaturesWhitelist with wgBetaFeaturesAllowList, Part I]] (duration: 08m 12s) [14:11:43] (03CR) 10Btullis: [C: 03+1] "This looks excellent! Thanks Nicolas." [alerts] - 10https://gerrit.wikimedia.org/r/887780 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison) [14:11:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882706 (owner: 10Jforrester) [14:12:37] (03Merged) 10jenkins-bot: Replace wgBetaFeaturesWhitelist with wgBetaFeaturesAllowList, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882706 (owner: 10Jforrester) [14:12:59] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2312 is OK: etcd last index (2396837) matches the master one (2396837) https://wikitech.wikimedia.org/wiki/Etcd [14:12:59] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1428 is OK: etcd last index (1583511) matches the master one (1583511) https://wikitech.wikimedia.org/wiki/Etcd [14:12:59] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2377 is OK: etcd last index (2396837) matches the master one (2396837) https://wikitech.wikimedia.org/wiki/Etcd [14:12:59] RECOVERY - MediaWiki EtcdConfig up-to-date on parse1013 is OK: etcd last index (1583511) matches the master one (1583511) https://wikitech.wikimedia.org/wiki/Etcd [14:12:59] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:882706|Replace wgBetaFeaturesWhitelist with wgBetaFeaturesAllowList, Part II]] [14:13:20] Hi mforns - IIRC you planned yesterday on doing a deploy today - is that correct? [14:13:48] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [14:14:27] woops - wrong cahn [14:14:50] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:882706|Replace wgBetaFeaturesWhitelist with wgBetaFeaturesAllowList, Part II]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [14:15:45] (03PS4) 10Jforrester: Move non-variant wgMFNearby to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833770 [14:16:03] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [14:16:33] (03CR) 10Ayounsi: [C: 03+2] Netbox: add support for central Redis [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [14:16:51] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: introduce additional route for VIPs [puppet] - 10https://gerrit.wikimedia.org/r/887373 (https://phabricator.wikimedia.org/T295774) [14:17:49] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:20:47] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:882706|Replace wgBetaFeaturesWhitelist with wgBetaFeaturesAllowList, Part II]] (duration: 07m 48s) [14:21:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833770 (owner: 10Jforrester) [14:22:27] (03Merged) 10jenkins-bot: Move non-variant wgMFNearby to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833770 (owner: 10Jforrester) [14:22:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P43819 and previous config saved to /var/cache/conftool/dbconfig/20230208-142229-marostegui.json [14:22:37] (03PS3) 10Jforrester: Move non-variant wgMFUseWikibase to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833771 [14:22:53] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:833770|Move non-variant wgMFNearby to CommonSettings]] [14:24:39] Superpes: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/887751/ is currently empty - did you need to make more changes? [14:24:42] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:833770|Move non-variant wgMFNearby to CommonSettings]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [14:25:25] Superpes: Also https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/887752/ needs to be minified. [14:26:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P43820 and previous config saved to /var/cache/conftool/dbconfig/20230208-142613-marostegui.json [14:26:50] (03PS2) 10Superpes15: Replace trwiki temporary logo with the correct one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887751 (https://phabricator.wikimedia.org/T329047) [14:26:50] courtesy link for how to minify SVG files: https://www.mediawiki.org/wiki/Manual:Assets#SVG_files [14:28:49] Superpes: And the itwikt image is too large? It needs to be < 120 px wide, I think? [14:29:23] Thanks urbanecm. [14:29:37] !log test install of testvm6001 T327867 [14:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:40] T327867: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 [14:29:55] @Urbanecm Thanks! James_F Uhm should it be 119x17 [14:30:30] Superpes: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/887752/2/static/images/mobile/copyright/wiktionary-wordmark-it.svg has `width="361.25" height="51.6875"`. [14:30:38] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Sporadic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10jbond) >>! In T238823#8596949, @cmooney wrote: > > I've seen this half-duplex close quite often down through the years. Some firewalls do it when they see a FIN... [14:31:41] James_F Oh in the svg! Thought you were talking about logos.php [14:31:46] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:833770|Move non-variant wgMFNearby to CommonSettings]] (duration: 08m 52s) [14:31:58] Superpes: Yeah, sorry for not being specific. [14:32:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833771 (owner: 10Jforrester) [14:33:02] (03Merged) 10jenkins-bot: Move non-variant wgMFUseWikibase to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833771 (owner: 10Jforrester) [14:33:13] James_F Yep I actually did it in a hurry without checking! Will fix it soon [14:33:28] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:833771|Move non-variant wgMFUseWikibase to CommonSettings]] [14:33:45] Superpes: Thanks! trwiki looks good to go. [14:34:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10Jclark-ctr) [14:34:51] (03PS3) 10Jforrester: Replace trwiki temporary logo with the correct one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887751 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [14:35:10] (03PS1) 10Andrew Bogott: wmcs-novastats-cephleaks.py: add 'delete' functionality [puppet] - 10https://gerrit.wikimedia.org/r/887789 (https://phabricator.wikimedia.org/T289623) [14:35:21] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:833771|Move non-variant wgMFUseWikibase to CommonSettings]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:36:11] (03CR) 10Andrew Bogott: "This whole script needs severe scrutiny because what it does is very serious and dangerous!" [puppet] - 10https://gerrit.wikimedia.org/r/887789 (https://phabricator.wikimedia.org/T289623) (owner: 10Andrew Bogott) [14:36:51] (03CR) 10Andrew Bogott: wmcs-novastats-cephleaks.py: add 'delete' functionality (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887789 (https://phabricator.wikimedia.org/T289623) (owner: 10Andrew Bogott) [14:37:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T328817)', diff saved to https://phabricator.wikimedia.org/P43821 and previous config saved to /var/cache/conftool/dbconfig/20230208-143735-marostegui.json [14:37:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [14:37:39] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [14:37:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [14:37:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T328817)', diff saved to https://phabricator.wikimedia.org/P43822 and previous config saved to /var/cache/conftool/dbconfig/20230208-143756-marostegui.json [14:39:58] (KubernetesCalicoDown) resolved: dse-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:41:06] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:833771|Move non-variant wgMFUseWikibase to CommonSettings]] (duration: 07m 37s) [14:41:09] (03PS4) 10Jforrester: Replace trwiki temporary legacy logo with one including the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887751 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [14:41:11] Finally. [14:41:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887751 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [14:41:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P43823 and previous config saved to /var/cache/conftool/dbconfig/20230208-144119-marostegui.json [14:41:23] (03PS3) 10Superpes15: Add a wordmark to itwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887752 (https://phabricator.wikimedia.org/T329168) [14:41:27] Superpes: Deploying your trwiki change now. [14:41:55] Oh will check James_F Also itwikt should be ready :) [14:42:00] (03Merged) 10jenkins-bot: Replace trwiki temporary legacy logo with one including the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887751 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [14:42:07] Superpes: Yeah, looks good! [14:42:25] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:887751|Replace trwiki temporary legacy logo with one including the wordmark (T329047)]] [14:42:28] T329047: Temporary logo change on the Turkish Wikipedia - https://phabricator.wikimedia.org/T329047 [14:44:12] (03CR) 10Nicolas Fraison: [C: 03+2] fix(varnishkafka): Use rate instead of irate and increase period of VarnishkafkaNoMessages alerts [alerts] - 10https://gerrit.wikimedia.org/r/887780 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison) [14:44:16] !log jforrester@deploy1002 jforrester and superpes: Backport for [[gerrit:887751|Replace trwiki temporary legacy logo with one including the wordmark (T329047)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:44:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T328817)', diff saved to https://phabricator.wikimedia.org/P43824 and previous config saved to /var/cache/conftool/dbconfig/20230208-144437-marostegui.json [14:44:40] I'll need to purge the logos from Varnish too, of course. [14:44:41] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [14:45:22] (03Merged) 10jenkins-bot: fix(varnishkafka): Use rate instead of irate and increase period of VarnishkafkaNoMessages alerts [alerts] - 10https://gerrit.wikimedia.org/r/887780 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison) [14:45:36] 10SRE, 10Traffic: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 (10MoritzMuehlenhoff) >>! In T321775#8596987, @Vgutierrez wrote: > 2.6.6 has been running as expected since the experiment started, next week we plan to upgrade the whole CDN We should upgrade to 2.6.8, thou... [14:47:47] Oy. [14:47:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10Jclark-ctr) Removed 2 gpu from an-worker1097 ,an-worker1096 And reinstalled in dse-k8s-worker1001 [14:47:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10Jclark-ctr) 05Open→03Resolved [14:49:08] 10SRE, 10Traffic: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 (10Vgutierrez) That already happened along the 2.4.21 upgrade [14:49:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10Jclark-ctr) 05Resolved→03Open Accidentally closed task [14:49:51] (03CR) 10Andrew Bogott: wmcs-novastats-cephleaks.py: add 'delete' functionality (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887789 (https://phabricator.wikimedia.org/T289623) (owner: 10Andrew Bogott) [14:50:08] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new test VM in drmrs - jmm@cumin2002" [14:50:17] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:887751|Replace trwiki temporary legacy logo with one including the wordmark (T329047)]] (duration: 07m 52s) [14:50:21] T329047: Temporary logo change on the Turkish Wikipedia - https://phabricator.wikimedia.org/T329047 [14:50:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887752 (https://phabricator.wikimedia.org/T329168) (owner: 10Superpes15) [14:51:09] (03Merged) 10jenkins-bot: Add a wordmark to itwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887752 (https://phabricator.wikimedia.org/T329168) (owner: 10Superpes15) [14:51:32] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:887752|Add a wordmark to itwiktionary (T329168)]] [14:51:38] T329168: Adding a wordmark to itwiktionary - https://phabricator.wikimedia.org/T329168 [14:52:28] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID controller battery for an-worker1087.eqiad.wmnet - https://phabricator.wikimedia.org/T328119 (10Jclark-ctr) 05Open→03Resolved [14:53:20] !log jforrester@deploy1002 superpes and jforrester: Backport for [[gerrit:887752|Add a wordmark to itwiktionary (T329168)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:53:45] (03CR) 10David Caro: cloudgw: consolidate extra floating IP routes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887372 (https://phabricator.wikimedia.org/T329041) (owner: 10Arturo Borrero Gonzalez) [14:53:59] This also looks good! [14:54:03] Excellent. [14:54:06] Deploying. [14:54:48] James_F Thanks :) [14:55:02] Superpes: Thank you! [14:56:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T328817)', diff saved to https://phabricator.wikimedia.org/P43825 and previous config saved to /var/cache/conftool/dbconfig/20230208-145625-marostegui.json [14:56:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [14:56:29] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [14:56:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [14:56:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [14:56:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [14:56:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T328817)', diff saved to https://phabricator.wikimedia.org/P43826 and previous config saved to /var/cache/conftool/dbconfig/20230208-145651-marostegui.json [14:57:41] (03PS1) 10Muehlenhoff: Assign installserver role to install1004 [puppet] - 10https://gerrit.wikimedia.org/r/887790 (https://phabricator.wikimedia.org/T327867) [14:57:50] (03PS2) 10Jforrester: labs: Remove unneeded GEUseNewImpactModule feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887271 (owner: 10Kosta Harlan) [14:57:53] (03PS2) 10Jbond: redfish: allow for refreshing the manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/885873 [14:58:03] (03CR) 10Jforrester: [C: 03+2] "I'll land and pull this to prod now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887271 (owner: 10Kosta Harlan) [14:58:29] (03Abandoned) 10Herron: admin: add user santhosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/885842 (https://phabricator.wikimedia.org/T328517) (owner: 10Herron) [14:58:44] (03Merged) 10jenkins-bot: labs: Remove unneeded GEUseNewImpactModule feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887271 (owner: 10Kosta Harlan) [14:59:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T328817)', diff saved to https://phabricator.wikimedia.org/P43828 and previous config saved to /var/cache/conftool/dbconfig/20230208-145901-marostegui.json [14:59:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P43829 and previous config saved to /var/cache/conftool/dbconfig/20230208-145944-marostegui.json [14:59:46] (03CR) 10Muehlenhoff: [C: 03+2] Assign installserver role to install1004 [puppet] - 10https://gerrit.wikimedia.org/r/887790 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [14:59:58] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:887752|Add a wordmark to itwiktionary (T329168)]] (duration: 08m 26s) [15:00:01] T329168: Adding a wordmark to itwiktionary - https://phabricator.wikimedia.org/T329168 [15:00:11] Okie-dokie, all done. [15:01:17] (03CR) 10CI reject: [V: 04-1] redfish: allow for refreshing the manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/885873 (owner: 10Jbond) [15:02:05] James_F Many thanks for your support and sorry for having messed with the svg :) [15:02:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2420.codfw.wmnet with OS buster [15:02:13] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster [15:02:22] Superpes: It happens. :-) [15:05:14] 10SRE, 10Traffic: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 (10MoritzMuehlenhoff) >>! In T321775#8597953, @Vgutierrez wrote: > That already happened along the 2.4.21 upgrade Yes, that's my point. We fixed CVE-2023-0056 with the upgrade to 2.4.21, so moving to 2.6.6 w... [15:09:59] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [15:10:01] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) @MoritzMuehlenhoff thanks for the tip. It did work however i am getting another error . Looks like we need to update the installer. Please let me know when it is done thamk... [15:10:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:11:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:14:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P43830 and previous config saved to /var/cache/conftool/dbconfig/20230208-151407-marostegui.json [15:14:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P43831 and previous config saved to /var/cache/conftool/dbconfig/20230208-151450-marostegui.json [15:15:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:19:39] (03PS1) 10Gmodena: Add a kafka consumer group to flink-app instance. [deployment-charts] - 10https://gerrit.wikimedia.org/r/887792 [15:21:21] (03CR) 10Gmodena: Add a kafka consumer group to flink-app instance. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/887792 (owner: 10Gmodena) [15:21:24] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm) [15:22:16] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm) a:05Jhancock.wm→03Papaul [15:22:42] (03PS3) 10Jbond: redfish: allow for refreshing the manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/885873 [15:25:25] 10SRE, 10Traffic: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 (10Vgutierrez) yeah.. I meant that along upgrading the 2.4 hosts to 2.4.21 I also updated the 2.6 ones to 2.6.8 :) [15:25:32] (03PS2) 10Muehlenhoff: Move webproxy in eqiad to install1004 [dns] - 10https://gerrit.wikimedia.org/r/886889 (https://phabricator.wikimedia.org/T327867) [15:26:52] (03PS2) 10JHathaway: Add jaeger-es-index-cleaner [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) [15:27:31] (03CR) 10Muehlenhoff: [C: 03+2] Move webproxy in eqiad to install1004 [dns] - 10https://gerrit.wikimedia.org/r/886889 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [15:29:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P43832 and previous config saved to /var/cache/conftool/dbconfig/20230208-152913-marostegui.json [15:29:18] (03CR) 10JHathaway: Add jaeger-es-index-cleaner (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [15:29:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T328817)', diff saved to https://phabricator.wikimedia.org/P43833 and previous config saved to /var/cache/conftool/dbconfig/20230208-152956-marostegui.json [15:29:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [15:30:00] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [15:30:08] (03PS2) 10Gmodena: Add a kafka consumer group to flink-app instance. [deployment-charts] - 10https://gerrit.wikimedia.org/r/887792 (https://phabricator.wikimedia.org/T329061) [15:30:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [15:30:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [15:30:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [15:30:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T328817)', diff saved to https://phabricator.wikimedia.org/P43834 and previous config saved to /var/cache/conftool/dbconfig/20230208-153022-marostegui.json [15:32:19] (03CR) 10Jbond: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/885873 (owner: 10Jbond) [15:32:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis) Thanks @Jclark-ctr, that's excellent. I can confirm that both cards are detected correctly. ` btulli... [15:32:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T328817)', diff saved to https://phabricator.wikimedia.org/P43835 and previous config saved to /var/cache/conftool/dbconfig/20230208-153248-marostegui.json [15:34:11] (03PS4) 10Hokwelum: Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 [15:36:55] (03CR) 10jenkins-bot: Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 (owner: 10Hokwelum) [15:38:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new test VM in drmrs - jmm@cumin2002" [15:40:09] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 (owner: 10Jbond) [15:40:15] (03PS1) 10Clément Goubert: Revert "P:monitoring: Absent hardcoded statsd host entry" [puppet] - 10https://gerrit.wikimedia.org/r/887762 [15:41:30] (03CR) 10Clément Goubert: [C: 03+2] Revert "P:monitoring: Absent hardcoded statsd host entry" [puppet] - 10https://gerrit.wikimedia.org/r/887762 (owner: 10Clément Goubert) [15:41:38] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] Revert "P:monitoring: Absent hardcoded statsd host entry" [puppet] - 10https://gerrit.wikimedia.org/r/887762 (owner: 10Clément Goubert) [15:42:23] (03CR) 10Muehlenhoff: [C: 03+2] Point DHCP server in eqiad to install1004 [homer/public] - 10https://gerrit.wikimedia.org/r/886888 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [15:44:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T328817)', diff saved to https://phabricator.wikimedia.org/P43836 and previous config saved to /var/cache/conftool/dbconfig/20230208-154420-marostegui.json [15:44:24] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [15:47:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: consolidate extra floating IP routes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887372 (https://phabricator.wikimedia.org/T329041) (owner: 10Arturo Borrero Gonzalez) [15:47:25] 10SRE, 10Infrastructure-Foundations, 10netops: BGPalerter crashing every 10 mins - https://phabricator.wikimedia.org/T329190 (10cmooney) p:05Triage→03Low [15:47:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P43837 and previous config saved to /var/cache/conftool/dbconfig/20230208-154754-marostegui.json [15:47:58] 10SRE, 10Infrastructure-Foundations, 10netops: BGPalerter crashing every 10 mins - https://phabricator.wikimedia.org/T329190 (10cmooney) [15:49:03] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "PCC does not look right https://puppet-compiler.wmflabs.org/output/887373/39465/cloudgw1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/887373 (https://phabricator.wikimedia.org/T295774) (owner: 10Arturo Borrero Gonzalez) [15:50:53] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2420.codfw.wmnet with OS buster [15:50:57] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster executed with errors: - mw2420 (**FAIL**) - Remove... [15:52:10] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: introduce additional route for VIPs [puppet] - 10https://gerrit.wikimedia.org/r/887373 (https://phabricator.wikimedia.org/T295774) [15:54:37] 10SRE, 10Infrastructure-Foundations, 10netops: BGPalerter crashing every 10 mins - https://phabricator.wikimedia.org/T329190 (10cmooney) [15:54:43] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate the potential benefits of BGPalerter - https://phabricator.wikimedia.org/T230600 (10cmooney) [15:57:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:58:03] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/887373/39467/" [puppet] - 10https://gerrit.wikimedia.org/r/887373 (https://phabricator.wikimedia.org/T295774) (owner: 10Arturo Borrero Gonzalez) [15:59:24] (03PS1) 10Muehlenhoff: Update next-server DHCP settings towards install1004 [puppet] - 10https://gerrit.wikimedia.org/r/887796 (https://phabricator.wikimedia.org/T327867) [16:00:20] 10SRE, 10Data-Persistence, 10Discovery-Search, 10serviceops, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Clement_Goubert) [16:00:45] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1096.eqiad.wmnet - https://phabricator.wikimedia.org/T329147 (10wiki_willy) a:03Jclark-ctr [16:01:15] 10SRE, 10Data-Persistence, 10Discovery-Search, 10serviceops, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Clement_Goubert) p:05Triage→03High [16:02:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:03:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P43838 and previous config saved to /var/cache/conftool/dbconfig/20230208-160301-marostegui.json [16:03:59] (03PS1) 10FNegri: Add support for cloud test env (codfw) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 [16:04:39] (03PS2) 10FNegri: Add support for cloud test env (codfw) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 [16:04:43] (03CR) 10Muehlenhoff: [C: 03+2] Update next-server DHCP settings towards install1004 [puppet] - 10https://gerrit.wikimedia.org/r/887796 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [16:10:25] (03PS1) 10Muehlenhoff: Update cloud proxies [puppet] - 10https://gerrit.wikimedia.org/r/887798 (https://phabricator.wikimedia.org/T327867) [16:12:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:14:02] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on rpki1001.eqiad.wmnet with reason: Restarting to increase VM RAM allocation [16:14:17] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on rpki1001.eqiad.wmnet with reason: Restarting to increase VM RAM allocation [16:14:22] 10SRE, 10Infrastructure-Foundations, 10netops: BGPalerter crashing every 10 mins - https://phabricator.wikimedia.org/T329190 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b43e2a20-f4d1-41c3-84c0-7923683997b4) set by cmooney@cumin1001 for 0:20:00 on 1 host(s) and their services with reas... [16:15:53] (03PS22) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [16:17:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10MoritzMuehlenhoff) >>! In T326362#8598021, @Papaul wrote: > @MoritzMuehlenhoff thanks for the tip. It did work however i am getting another error . Looks like we need to update the... [16:17:28] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:18:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T328817)', diff saved to https://phabricator.wikimedia.org/P43839 and previous config saved to /var/cache/conftool/dbconfig/20230208-161807-marostegui.json [16:18:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:18:11] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [16:18:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:18:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43840 and previous config saved to /var/cache/conftool/dbconfig/20230208-161828-marostegui.json [16:18:49] (03PS23) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [16:19:09] (03CR) 10Elukey: Add sre.k8s.upgrade-cluster (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [16:24:11] (03CR) 10Dzahn: "a bit surprised by this reply. haven't we been adding headers all the time? thought this would be helpful" [puppet] - 10https://gerrit.wikimedia.org/r/887382 (owner: 10Dzahn) [16:24:16] (03Abandoned) 10Dzahn: add SPDX license headers to various roles I was involved in writing [puppet] - 10https://gerrit.wikimedia.org/r/887382 (owner: 10Dzahn) [16:24:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis) [16:24:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43841 and previous config saved to /var/cache/conftool/dbconfig/20230208-162447-marostegui.json [16:24:51] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [16:25:46] (JobUnavailable) firing: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:27:08] !log rolling restart of haproxy and trafficserver in A:cp [16:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:06] (03CR) 10Muehlenhoff: add SPDX license headers to various roles I was involved in writing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887382 (owner: 10Dzahn) [16:33:23] (03CR) 10Dzahn: "gotcha, thanks for the additional explanation" [puppet] - 10https://gerrit.wikimedia.org/r/887382 (owner: 10Dzahn) [16:33:48] (03PS1) 10Cwhite: logstash: ulogd remove copy network.transport to network.protocol [puppet] - 10https://gerrit.wikimedia.org/r/886857 (https://phabricator.wikimedia.org/T329195) [16:33:50] (03PS1) 10Milimetric: Increase time window to reduce false positives [alerts] - 10https://gerrit.wikimedia.org/r/887803 [16:33:58] (03CR) 10CI reject: [V: 04-1] Increase time window to reduce false positives [alerts] - 10https://gerrit.wikimedia.org/r/887803 (owner: 10Milimetric) [16:35:00] (03CR) 10Milimetric: "Just a thought on the somewhat noisy varnishkafka alerts" [alerts] - 10https://gerrit.wikimedia.org/r/887803 (owner: 10Milimetric) [16:35:46] (JobUnavailable) resolved: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:03] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/887804 [16:37:06] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Jclark-ctr) [16:38:36] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10Dzahn) Ah yeah, I remember those now. Those that are including the "role::mediawiki::common" I gave up on that... [16:38:48] (03PS3) 10Giuseppe Lavagetto: sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 [16:38:50] (03PS3) 10Giuseppe Lavagetto: sre.discovery.datacenter: add --fast-insecure switch for pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/887741 [16:38:52] (03PS1) 10Giuseppe Lavagetto: sre.discovery.datacenter: fix rollback logic [cookbooks] - 10https://gerrit.wikimedia.org/r/887806 (https://phabricator.wikimedia.org/T329175) [16:39:47] (03PS1) 10Btullis: Remove the GPU configuration from an-worker109[67] [puppet] - 10https://gerrit.wikimedia.org/r/887807 (https://phabricator.wikimedia.org/T318696) [16:39:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P43842 and previous config saved to /var/cache/conftool/dbconfig/20230208-163954-marostegui.json [16:41:07] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/887767 (https://phabricator.wikimedia.org/T327161) (owner: 10Filippo Giunchedi) [16:41:21] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/887807 (https://phabricator.wikimedia.org/T318696) (owner: 10Btullis) [16:43:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:44:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:44:05] !log [done] rolling restart of haproxy and trafficserver in A:cp [16:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:50] (03CR) 10Dzahn: [C: 03+1] vrts: enable/disable daemon depending on active host [puppet] - 10https://gerrit.wikimedia.org/r/886914 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [16:45:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [16:45:58] !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [16:47:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [16:47:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [16:50:06] (03CR) 10Bking: [C: 03+1] elasticsearch/relforge: add contint2002 to cirrus::ferm_srange [puppet] - 10https://gerrit.wikimedia.org/r/867714 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [16:51:17] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/887767 (https://phabricator.wikimedia.org/T327161) (owner: 10Filippo Giunchedi) [16:52:18] (03CR) 10Ottomata: [C: 03+1] "Let's merge this (with other patches?) when we are ready to deploy." [deployment-charts] - 10https://gerrit.wikimedia.org/r/887792 (https://phabricator.wikimedia.org/T329061) (owner: 10Gmodena) [16:53:09] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [16:54:05] akosiaris, mutante: seems down - upstream connect error or disconnect/reset before headers. reset reason: connection failure [16:55:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P43844 and previous config saved to /var/cache/conftool/dbconfig/20230208-165500-marostegui.json [16:55:46] (JobUnavailable) firing: Reduced availability for job etherpad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:57:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolserver_legacy: Add image credits [puppet] - 10https://gerrit.wikimedia.org/r/887781 (https://phabricator.wikimedia.org/T103965) (owner: 10Majavah) [16:57:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add support for cloud test env (codfw) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [16:58:25] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1003 is OK: HTTP OK: HTTP/1.1 200 OK - 6448 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [17:00:27] (03CR) 10JMeybohm: "The two namespaces are still a bit confusing. I would suggest to drop every "namespace: spark-operator" line as that's the namespace the c" [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [17:00:46] (JobUnavailable) resolved: Reduced availability for job etherpad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:00:51] works again. I did nothing. in meeting now [17:00:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:01:00] thanks for the report though [17:02:51] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: introduce additional route for VIPs [puppet] - 10https://gerrit.wikimedia.org/r/887373 (https://phabricator.wikimedia.org/T295774) [17:03:30] 10SRE, 10Infrastructure-Foundations, 10netops: BGPalerter crashing every 10 mins - https://phabricator.wikimedia.org/T329190 (10cmooney) I upgraded rpki1001 to 4GB RAM. Things looking stable now, service hasn't crashed. Used mem has settled down to about ~1.8GB. I'll take a look at rpki2002 shortly. {F36... [17:05:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:06:38] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/887373 (https://phabricator.wikimedia.org/T295774) (owner: 10Arturo Borrero Gonzalez) [17:07:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: introduce additional route for VIPs [puppet] - 10https://gerrit.wikimedia.org/r/887373 (https://phabricator.wikimedia.org/T295774) (owner: 10Arturo Borrero Gonzalez) [17:08:33] !log oblivian@cumin2002 START - Cookbook sre.discovery.datacenter status all services in codfw: maintenance [17:08:56] !log oblivian@cumin2002 END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) status all services in codfw: maintenance [17:08:58] PROBLEM - Checks that the airflow database for airflow search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [17:09:06] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) [17:09:21] !log disable bacula job backup1002.eqiad.wmnet-Weekly-Thu-EsRwCodfw-mysql-srv-backups-dumps-latest [17:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:40] (03CR) 10Gmodena: Add a kafka consumer group to flink-app instance. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/887792 (https://phabricator.wikimedia.org/T329061) (owner: 10Gmodena) [17:10:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43847 and previous config saved to /var/cache/conftool/dbconfig/20230208-171006-marostegui.json [17:10:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [17:10:10] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [17:10:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [17:10:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T328817)', diff saved to https://phabricator.wikimedia.org/P43848 and previous config saved to /var/cache/conftool/dbconfig/20230208-171028-marostegui.json [17:10:36] PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [17:11:02] !log oblivian@cumin2002 START - Cookbook sre.discovery.datacenter status all services in codfw: maintenance [17:11:10] !log oblivian@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in codfw: maintenance [17:13:50] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on rpki2002.codfw.wmnet with reason: Restarting to increase VM RAM allocation [17:14:04] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on rpki2002.codfw.wmnet with reason: Restarting to increase VM RAM allocation [17:14:09] 10SRE, 10Infrastructure-Foundations, 10netops: BGPalerter crashing every 10 mins - https://phabricator.wikimedia.org/T329190 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=31176d14-7d44-4799-8369-4293e8a58f51) set by cmooney@cumin1001 for 0:15:00 on 1 host(s) and their services with reas... [17:15:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [17:15:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [17:16:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T328817)', diff saved to https://phabricator.wikimedia.org/P43849 and previous config saved to /var/cache/conftool/dbconfig/20230208-171634-marostegui.json [17:16:38] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [17:20:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance [17:20:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance [17:20:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43850 and previous config saved to /var/cache/conftool/dbconfig/20230208-172021-marostegui.json [17:20:46] (JobUnavailable) firing: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:22:25] !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for rpki2002.codfw.wmnet,rpki1001.eqiad.wmnet [17:22:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for rpki2002.codfw.wmnet,rpki1001.eqiad.wmnet [17:24:51] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: eqiad1: move remaining VIP to /32 netmask [puppet] - 10https://gerrit.wikimedia.org/r/887811 (https://phabricator.wikimedia.org/T295774) [17:25:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: eqiad1: move remaining VIP to /32 netmask [puppet] - 10https://gerrit.wikimedia.org/r/887811 (https://phabricator.wikimedia.org/T295774) (owner: 10Arturo Borrero Gonzalez) [17:26:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43851 and previous config saved to /var/cache/conftool/dbconfig/20230208-172641-marostegui.json [17:30:46] (JobUnavailable) resolved: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:31:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P43852 and previous config saved to /var/cache/conftool/dbconfig/20230208-173140-marostegui.json [17:33:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [17:33:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [17:33:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P43853 and previous config saved to /var/cache/conftool/dbconfig/20230208-173325-ladsgroup.json [17:41:39] (03CR) 10Herron: "sketching out a possible alternative to forcing statsd clients to ipv4 only https://puppet-compiler.wmflabs.org/output/887804/39469/graphi" [puppet] - 10https://gerrit.wikimedia.org/r/887804 (owner: 10Herron) [17:41:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P43854 and previous config saved to /var/cache/conftool/dbconfig/20230208-174148-marostegui.json [17:42:14] (03PS1) 10Andrea Denisse: centrallog: Add centrallog1001 to quickdatacopy allow hosts [puppet] - 10https://gerrit.wikimedia.org/r/887812 (https://phabricator.wikimedia.org/T318778) [17:44:24] (03CR) 10Ottomata: [C: 03+1] Add a kafka consumer group to flink-app instance. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/887792 (https://phabricator.wikimedia.org/T329061) (owner: 10Gmodena) [17:44:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate the potential benefits of BGPalerter - https://phabricator.wikimedia.org/T230600 (10cmooney) [17:45:04] 10SRE, 10Infrastructure-Foundations, 10netops: BGPalerter crashing every 10 mins - https://phabricator.wikimedia.org/T329190 (10cmooney) 05Open→03Resolved a:03cmooney Change made on rpki2002 also and it seems happy. Closing task. [17:45:12] (03Abandoned) 10Arturo Borrero Gonzalez: eqiad1: cloudgw: make VIPs use /32 netmask [puppet] - 10https://gerrit.wikimedia.org/r/887288 (https://phabricator.wikimedia.org/T295774) (owner: 10Arturo Borrero Gonzalez) [17:45:21] 10SRE, 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10aborrero) a:03Volans hey @Volans I made the required changes on our side. Please verify if Netbox is happier now and close the ticket if so... [17:45:40] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10aborrero) [17:46:29] 10SRE: add Hal Triedman (htriedman) to ops-l mailing list - https://phabricator.wikimedia.org/T329209 (10Htriedman) [17:46:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P43855 and previous config saved to /var/cache/conftool/dbconfig/20230208-174647-marostegui.json [17:47:43] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 9 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [17:55:28] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) @Clement_Goubert @LSobanski @thcipriani I'd like to ping translators before the end of this week. Befo... [17:56:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P43856 and previous config saved to /var/cache/conftool/dbconfig/20230208-175654-marostegui.json [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230208T1800) [18:00:26] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 9 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [18:01:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T328817)', diff saved to https://phabricator.wikimedia.org/P43857 and previous config saved to /var/cache/conftool/dbconfig/20230208-180153-marostegui.json [18:01:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [18:02:01] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [18:02:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [18:02:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43858 and previous config saved to /var/cache/conftool/dbconfig/20230208-180216-marostegui.json [18:02:17] (03CR) 10Herron: opensearch: reverse-proxy access to opensearch API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [18:06:01] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10SHust) @BCornwall would you mind emailing me with more information about what is needed here so I can better understand + contact the right Shopify representative? Thanks in advance!... [18:06:11] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:07:15] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:09:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43859 and previous config saved to /var/cache/conftool/dbconfig/20230208-180933-marostegui.json [18:09:36] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [18:12:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43860 and previous config saved to /var/cache/conftool/dbconfig/20230208-181200-marostegui.json [18:12:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1122.eqiad.wmnet with reason: Maintenance [18:12:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1122.eqiad.wmnet with reason: Maintenance [18:12:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T328817)', diff saved to https://phabricator.wikimedia.org/P43861 and previous config saved to /var/cache/conftool/dbconfig/20230208-181222-marostegui.json [18:15:02] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Dzahn) See this previous comment from Brandon at T128559#3440144 [18:18:15] 10SRE: followups to unactionable NELHigh pages due to Telecom Italia outage, 2023-02-05 - https://phabricator.wikimedia.org/T328941 (10CDanis) [18:18:29] (03PS30) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [18:18:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T328817)', diff saved to https://phabricator.wikimedia.org/P43862 and previous config saved to /var/cache/conftool/dbconfig/20230208-181829-marostegui.json [18:18:33] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [18:22:10] 10SRE, 10Sustainability (Incident Followup): followups to unactionable NELHigh pages due to Telecom Italia outage, 2023-02-05 - https://phabricator.wikimedia.org/T328941 (10CDanis) [18:24:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P43863 and previous config saved to /var/cache/conftool/dbconfig/20230208-182439-marostegui.json [18:25:30] 10ops-esams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [18:26:23] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall) [18:26:27] 10ops-esams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [18:26:38] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [18:27:26] (03Abandoned) 10BCornwall: idp: Set cloud TLS/SSL compatiblility to strong [puppet] - 10https://gerrit.wikimedia.org/r/885844 (https://phabricator.wikimedia.org/T238518) (owner: 10BCornwall) [18:27:36] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall) 05In progress→03Resolved I've been advised to move the LDAP work into a separate ticket (T329218) since the traffic team doesn't have enough hands-on... [18:30:18] 10SRE-tools, 10Discovery-Search, 10Elasticsearch, 10Infrastructure-Foundations, 10Spicerack: elasticsearch spicerack module failes with most recent elastic-curator - https://phabricator.wikimedia.org/T328775 (10bking) Thanks @jbond ! Looking at the Spicerack changelog, I also see that [[ https://gerrit.... [18:30:51] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10Dzahn) @Muehlenhoff Here was my attempt to fix the "mediawiki::common" ones: https://gerrit.wikimedia.org/r/c... [18:31:15] 10SRE, 10ops-esams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [18:31:19] (03Abandoned) 10Milimetric: Increase time window to reduce false positives [alerts] - 10https://gerrit.wikimedia.org/r/887803 (owner: 10Milimetric) [18:32:28] 10SRE, 10ops-esams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [18:33:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P43864 and previous config saved to /var/cache/conftool/dbconfig/20230208-183336-marostegui.json [18:34:13] (03PS2) 10Andrea Denisse: centrallog: Add centrallog1001 to quickdatacopy allow hosts [puppet] - 10https://gerrit.wikimedia.org/r/887812 (https://phabricator.wikimedia.org/T318778) [18:35:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P43865 and previous config saved to /var/cache/conftool/dbconfig/20230208-183556-ladsgroup.json [18:35:59] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [18:36:17] 10SRE, 10Traffic-Icebox, 10Sustainability (Incident Followup): cp1083: ats-tls and varnish-fe crashed due to insufficient memory - https://phabricator.wikimedia.org/T241593 (10BCornwall) 05Open→03Invalid This ticket is in a weird place: ats-tls is no longer in use, but the fact that *both* ats-tls and va... [18:36:24] (03CR) 10Dzahn: [C: 03+2] add az.wikimedia.org for Azerbaijani Wikimedians User Group [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) (owner: 10Dzahn) [18:36:38] (03CR) 10Dzahn: [C: 03+2] "no other responses ever, gotta be bold :)" [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) (owner: 10Dzahn) [18:36:57] (03PS3) 10Dzahn: add az.wikimedia.org for Azerbaijani Wikimedians User Group [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) [18:37:14] 10SRE, 10ops-esams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [18:39:21] !log adding az.wikimedia.org to DNS - approved by affcom T306015 [18:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:24] T306015: Create a wiki for Azerbaijani Wikimedians User Group - https://phabricator.wikimedia.org/T306015 [18:39:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P43866 and previous config saved to /var/cache/conftool/dbconfig/20230208-183945-marostegui.json [18:40:29] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) 05Stalled→03In progress [18:40:35] 10SRE, 10Traffic, 10HTTPS, 10Tracking-Neverending: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681 (10BCornwall) [18:40:47] 10SRE, 10LDAP: LDAP connections use TLSv1.0 and TLSv1.1 - https://phabricator.wikimedia.org/T329218 (10BCornwall) [18:42:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2420.codfw.wmnet with OS buster [18:42:33] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster [18:48:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P43867 and previous config saved to /var/cache/conftool/dbconfig/20230208-184842-marostegui.json [18:48:46] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) @MoritzMuehlenhoff thank you. [18:49:34] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) @SHust, as @Dzahn pointed out, it would be best if we keep this all in one place. We specifically need the `preload` and `includeSubDomains` attributes added to the `Stric... [18:51:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P43868 and previous config saved to /var/cache/conftool/dbconfig/20230208-185102-ladsgroup.json [18:54:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43869 and previous config saved to /var/cache/conftool/dbconfig/20230208-185451-marostegui.json [18:54:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [18:54:55] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [18:55:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [18:55:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T328817)', diff saved to https://phabricator.wikimedia.org/P43870 and previous config saved to /var/cache/conftool/dbconfig/20230208-185513-marostegui.json [18:57:12] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2420.codfw.wmnet with OS buster [18:57:18] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster executed with errors: - mw2420 (**FAIL**) - Remove... [18:59:23] !log milimetric@deploy1002 Started deploy [analytics/refinery@9101b03]: Regular analytics weekly train [analytics/refinery@9101b03] [18:59:34] (03PS1) 10Cwhite: profile: disable grafana db sync ahead of 9.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/886860 (https://phabricator.wikimedia.org/T317887) [18:59:37] (03CR) 10Dzahn: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/867714 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [19:00:05] ^demon and dancy: #bothumor I � Unicode. All rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230208T1900). [19:00:05] ^demon and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230208T1900). [19:01:29] o/ [19:02:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T328817)', diff saved to https://phabricator.wikimedia.org/P43871 and previous config saved to /var/cache/conftool/dbconfig/20230208-190226-marostegui.json [19:02:28] (03CR) 10Dzahn: [C: 03+2] "checked on relforge1003, no issue, ferm gets refreshed" [puppet] - 10https://gerrit.wikimedia.org/r/867714 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [19:02:30] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [19:03:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T328817)', diff saved to https://phabricator.wikimedia.org/P43872 and previous config saved to /var/cache/conftool/dbconfig/20230208-190349-marostegui.json [19:03:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1129.eqiad.wmnet with reason: Maintenance [19:04:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1129.eqiad.wmnet with reason: Maintenance [19:04:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T328817)', diff saved to https://phabricator.wikimedia.org/P43873 and previous config saved to /var/cache/conftool/dbconfig/20230208-190410-marostegui.json [19:05:51] !log milimetric@deploy1002 Finished deploy [analytics/refinery@9101b03]: Regular analytics weekly train [analytics/refinery@9101b03] (duration: 06m 28s) [19:05:54] !log milimetric@deploy1002 Started deploy [analytics/refinery@9101b03] (thin): Regular analytics weekly train THIN [analytics/refinery@9101b03] [19:06:01] !log milimetric@deploy1002 Finished deploy [analytics/refinery@9101b03] (thin): Regular analytics weekly train THIN [analytics/refinery@9101b03] (duration: 00m 07s) [19:06:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P43874 and previous config saved to /var/cache/conftool/dbconfig/20230208-190608-ladsgroup.json [19:06:42] !log milimetric@deploy1002 Started deploy [analytics/refinery@9101b03] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@9101b03] [19:07:47] (03CR) 10Dzahn: [C: 03+2] phorge: install php-mbstring, php-curl and php-mysql modules [puppet] - 10https://gerrit.wikimedia.org/r/887433 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [19:08:02] !log milimetric@deploy1002 Finished deploy [analytics/refinery@9101b03] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@9101b03] (duration: 01m 20s) [19:08:03] (03PS2) 10Dzahn: phorge: move httpd setup to profile and don't call it apache [puppet] - 10https://gerrit.wikimedia.org/r/887432 (https://phabricator.wikimedia.org/T328595) [19:09:22] (03CR) 10Dzahn: [C: 03+2] phorge: move httpd setup to profile and don't call it apache [puppet] - 10https://gerrit.wikimedia.org/r/887432 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [19:10:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T328817)', diff saved to https://phabricator.wikimedia.org/P43875 and previous config saved to /var/cache/conftool/dbconfig/20230208-191023-marostegui.json [19:10:27] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [19:12:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2420.codfw.wmnet with OS buster [19:12:41] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster [19:17:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P43876 and previous config saved to /var/cache/conftool/dbconfig/20230208-191732-marostegui.json [19:18:08] (03CR) 10Ladsgroup: [C: 03+1] multiversion: Create dblist-manage command for easy add/delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885064 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [19:18:13] (03PS3) 10Krinkle: multiversion: Create dblist-manage command for easy add/delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885064 (https://phabricator.wikimedia.org/T308932) [19:18:15] (03PS3) 10Krinkle: logos: Exclude logos/index.html from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885065 [19:18:17] (03PS3) 10Krinkle: multiversion: Remove getCachableMWConfig in favour of getConfigGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885066 (https://phabricator.wikimedia.org/T308932) [19:18:19] (03CR) 10Ladsgroup: [C: 03+1] logos: Exclude logos/index.html from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885065 (owner: 10Krinkle) [19:18:21] (03CR) 10Ladsgroup: [C: 03+1] multiversion: Remove getCachableMWConfig in favour of getConfigGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885066 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [19:21:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P43877 and previous config saved to /var/cache/conftool/dbconfig/20230208-192115-ladsgroup.json [19:21:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [19:21:19] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [19:21:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [19:21:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43878 and previous config saved to /var/cache/conftool/dbconfig/20230208-192136-ladsgroup.json [19:25:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P43879 and previous config saved to /var/cache/conftool/dbconfig/20230208-192530-marostegui.json [19:32:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P43880 and previous config saved to /var/cache/conftool/dbconfig/20230208-193239-marostegui.json [19:34:31] (03PS2) 10Dzahn: phorge: install php-mbstring, php-curl and php-mysql modules [puppet] - 10https://gerrit.wikimedia.org/r/887433 (https://phabricator.wikimedia.org/T328595) [19:34:33] (03PS1) 10Dzahn: phorge: renamed apache.conf.erb to httpd.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/887820 (https://phabricator.wikimedia.org/T328595) [19:34:56] (03CR) 10Dzahn: phorge: install php-mbstring, php-curl and php-mysql modules [puppet] - 10https://gerrit.wikimedia.org/r/887433 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [19:35:21] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) @MoritzMuehlenhoff this did turn out to be a raid controller/disk issue and not a Debian installer issue. Sorry for the noise. [19:35:24] (03PS2) 10Dzahn: phorge: rename apache.conf.erb to httpd.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/887820 (https://phabricator.wikimedia.org/T328595) [19:39:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2420.codfw.wmnet with reason: host reimage [19:40:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P43881 and previous config saved to /var/cache/conftool/dbconfig/20230208-194036-marostegui.json [19:41:31] (03CR) 10Dzahn: [C: 03+2] phorge: rename apache.conf.erb to httpd.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/887820 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [19:42:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2420.codfw.wmnet with reason: host reimage [19:43:17] PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 1 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [19:44:26] (03CR) 10Urbanecm: [C: 03+1] Add Apache configuration for azwikimedia [puppet] - 10https://gerrit.wikimedia.org/r/887434 (https://phabricator.wikimedia.org/T306015) (owner: 10Zabe) [19:47:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T328817)', diff saved to https://phabricator.wikimedia.org/P43882 and previous config saved to /var/cache/conftool/dbconfig/20230208-194745-marostegui.json [19:47:49] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [19:50:52] (03CR) 10Dzahn: Revert "contint: remove obsolete firewall rules from labs" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) (owner: 10Hashar) [19:52:39] !log dancy@deploy1002 say aborted: (duration: 00m 01s) [19:53:22] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10RobH) [19:53:29] (03CR) 10Dzahn: Revert "contint: remove obsolete firewall rules from labs" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) (owner: 10Hashar) [19:55:02] (03PS1) 10Urbanecm: [logos] Make logos/manage.py work again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887822 [19:55:06] (03CR) 10Dzahn: "I wonder how we got to "Fast forward to 2020, we no more use Puppet to provide the test environment." and now back to managing beta with p" [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) (owner: 10Hashar) [19:55:16] (03CR) 10Dzahn: [C: 03+2] Revert "contint: remove obsolete firewall rules from labs" [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) (owner: 10Hashar) [19:55:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T328817)', diff saved to https://phabricator.wikimedia.org/P43883 and previous config saved to /var/cache/conftool/dbconfig/20230208-195542-marostegui.json [19:55:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [19:55:46] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [19:55:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [19:56:10] (03CR) 10Dzahn: [C: 03+2] "not in love with it but merging it anyways because it's going back to a previous status quo and I don't want to block you. I would still l" [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) (owner: 10Hashar) [19:58:05] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:59:33] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10RobH) [19:59:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [20:00:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [20:00:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43884 and previous config saved to /var/cache/conftool/dbconfig/20230208-200006-marostegui.json [20:01:01] (03CR) 10Dzahn: [C: 03+2] "would you also be willing to review my contint-related changes again now?" [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) (owner: 10Hashar) [20:02:04] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10SHust) I'll do as suggested and keep all communications here, thanks @Dzahn. @BCornwall I emailed the Shopify rep that takes care of our account and will update this thread as soon a... [20:02:40] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887824 (https://phabricator.wikimedia.org/T325585) [20:02:42] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887824 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot) [20:03:23] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887824 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot) [20:04:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:04:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2420.codfw.wmnet with OS buster [20:04:58] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster completed: - mw2420 (**PASS**) - Removed from Pupp... [20:06:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43885 and previous config saved to /var/cache/conftool/dbconfig/20230208-200614-marostegui.json [20:06:18] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [20:11:11] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.22 refs T325585 [20:11:14] T325585: 1.40.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T325585 [20:11:56] (03PS1) 10Urbanecm: [logos] Regenerate logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887825 [20:11:57] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10RobH) [20:15:01] (03PS2) 10Urbanecm: [logos] Regenerate logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887825 [20:17:14] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Dzahn) >>! In T128559#3440144, @BBlack wrote: > It seems like Shopify has been making some improvements on this front since we last checked. > .. > The help page doesn't indicate whe... [20:17:44] !log demon@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.22 refs T325585 (duration: 06m 33s) [20:17:47] T325585: 1.40.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T325585 [20:18:14] (03PS1) 10Urbanecm: guwwikiquote: Add custom logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887826 (https://phabricator.wikimedia.org/T321247) [20:21:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P43886 and previous config saved to /var/cache/conftool/dbconfig/20230208-202120-marostegui.json [20:21:57] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10RobH) [20:22:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43887 and previous config saved to /var/cache/conftool/dbconfig/20230208-202249-ladsgroup.json [20:22:52] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [20:27:26] (03PS1) 10Cwhite: Upgrade plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/886861 (https://phabricator.wikimedia.org/T317887) [20:31:18] (03CR) 10RLazarus: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39474/console" [puppet] - 10https://gerrit.wikimedia.org/r/868212 (https://phabricator.wikimedia.org/T290536) (owner: 10RLazarus) [20:33:01] (03PS2) 10Cwhite: Upgrade plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/886861 (https://phabricator.wikimedia.org/T317887) [20:34:28] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39475/console" [puppet] - 10https://gerrit.wikimedia.org/r/887434 (https://phabricator.wikimedia.org/T306015) (owner: 10Zabe) [20:35:05] (03CR) 10Herron: [C: 03+2] "discussed during weekly o11y standup -- this is good to go, we will follow up separately to the "majority up" points" [puppet] - 10https://gerrit.wikimedia.org/r/887342 (owner: 10Herron) [20:36:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P43888 and previous config saved to /var/cache/conftool/dbconfig/20230208-203627-marostegui.json [20:37:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P43889 and previous config saved to /var/cache/conftool/dbconfig/20230208-203755-ladsgroup.json [20:41:42] (03PS2) 10RLazarus: Add Apache configuration for azwikimedia [puppet] - 10https://gerrit.wikimedia.org/r/887434 (https://phabricator.wikimedia.org/T306015) (owner: 10Zabe) [20:42:33] (03CR) 10RLazarus: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/887434 (https://phabricator.wikimedia.org/T306015) (owner: 10Zabe) [20:42:52] (03PS1) 10Urbanecm: [tox] Use py39 jobs only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) [20:43:36] (03PS2) 10Urbanecm: [tox] Use py39 jobs only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) [20:43:44] !log disabling puppet on C:profile::mediawiki::webserver to merge and test 887434 - T306015 [20:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:47] T306015: Create a wiki for Azerbaijani Wikimedians User Group - https://phabricator.wikimedia.org/T306015 [20:45:41] (03CR) 10Cwhite: [C: 03+2] profile: disable grafana db sync ahead of 9.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/886860 (https://phabricator.wikimedia.org/T317887) (owner: 10Cwhite) [20:48:42] !log enabled puppet on C:profile::mediawiki::webserver - T306015 [20:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43890 and previous config saved to /var/cache/conftool/dbconfig/20230208-205133-marostegui.json [20:51:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [20:51:37] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [20:51:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [20:51:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:52:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:52:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T328817)', diff saved to https://phabricator.wikimedia.org/P43891 and previous config saved to /var/cache/conftool/dbconfig/20230208-205211-marostegui.json [20:53:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P43892 and previous config saved to /var/cache/conftool/dbconfig/20230208-205301-ladsgroup.json [20:54:22] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10SHust) The additional information did help clear things up, thanks @Dzahn. I'm also glad to see that the test results have improved. I have added a substantial amount of information a... [20:58:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T328817)', diff saved to https://phabricator.wikimedia.org/P43893 and previous config saved to /var/cache/conftool/dbconfig/20230208-205803-marostegui.json [20:58:07] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230208T2100). [21:00:04] Urbanecm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:15] let's do it :) [21:00:22] all yours :D [21:00:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887822 (owner: 10Urbanecm) [21:00:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887825 (owner: 10Urbanecm) [21:00:53] (and just for the avoidance of doubt, I'm hands-off from that apache config change, fire away!) [21:01:01] ty rzl! [21:01:29] (03Merged) 10jenkins-bot: [logos] Make logos/manage.py work again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887822 (owner: 10Urbanecm) [21:01:32] (03Merged) 10jenkins-bot: [logos] Regenerate logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887825 (owner: 10Urbanecm) [21:03:04] ... [21:03:12] Gerrit could not merge the change '887825' as is and could require a rebase [21:03:18] both changes got merged? [21:03:42] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:887822|[logos] Make logos/manage.py work again]], [[gerrit:887825|[logos] Regenerate logos.php]] [21:05:01] I guess you already rebased 887825 on top of 887822 yourself before [21:06:21] i kind of expected scap to notice that [21:07:41] (03PS2) 10Urbanecm: guwwikiquote: Add custom logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887826 (https://phabricator.wikimedia.org/T321247) [21:08:03] (03CR) 10Urbanecm: [C: 03+2] guwwikiquote: Add custom logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887826 (https://phabricator.wikimedia.org/T321247) (owner: 10Urbanecm) [21:08:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43894 and previous config saved to /var/cache/conftool/dbconfig/20230208-210807-ladsgroup.json [21:08:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [21:08:12] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [21:08:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [21:08:45] (03Merged) 10jenkins-bot: guwwikiquote: Add custom logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887826 (https://phabricator.wikimedia.org/T321247) (owner: 10Urbanecm) [21:10:48] (03CR) 10Cwhite: [V: 03+1] "Builds OK" [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/886861 (https://phabricator.wikimedia.org/T317887) (owner: 10Cwhite) [21:11:15] (03CR) 10Hashar: [C: 03+1] Revert "contint: remove obsolete firewall rules from labs" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) (owner: 10Hashar) [21:11:24] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:887822|[logos] Make logos/manage.py work again]], [[gerrit:887825|[logos] Regenerate logos.php]] (duration: 07m 42s) [21:11:49] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:887826|guwwikiquote: Add custom logo (T321247)]] [21:11:52] T321247: Create Wikiquote Gungbe - https://phabricator.wikimedia.org/T321247 [21:13:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P43895 and previous config saved to /var/cache/conftool/dbconfig/20230208-211309-marostegui.json [21:13:37] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:887826|guwwikiquote: Add custom logo (T321247)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:18:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2421.codfw.wmnet with OS buster [21:18:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2421.codfw.wmnet with OS buster [21:20:00] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:887826|guwwikiquote: Add custom logo (T321247)]] (duration: 08m 10s) [21:20:03] T321247: Create Wikiquote Gungbe - https://phabricator.wikimedia.org/T321247 [21:25:07] (03CR) 10Hashar: [C: 04-1] "Sorry I was waiting for the change to be amended following my review while Daniel was expecting response from my side. In short we had a d" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [21:25:36] (03PS3) 10Hashar: contint: factor common firewalling rules [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) [21:26:55] (03CR) 10Hashar: [C: 04-1] "Worth a note, today I had to unbreak some ferm rules for the Jenkins agent and went to refactor part of the code which moves the list of h" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [21:27:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2422.codfw.wmnet with OS buster [21:27:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2422.codfw.wmnet with OS buster [21:28:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P43896 and previous config saved to /var/cache/conftool/dbconfig/20230208-212815-marostegui.json [21:28:24] (03CR) 10Hashar: "Note this conflicts with https://gerrit.wikimedia.org/r/c/operations/puppet/+/850593 which proposes to move the list of Jenkins controller" [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar) [21:31:59] (03CR) 10Herron: "shall we move forward with this?" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [21:32:28] (03PS3) 10Urbanecm: [tox] Use py39 jobs only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) [21:35:20] (03CR) 10Urbanecm: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm) [21:37:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2421.codfw.wmnet with reason: host reimage [21:40:05] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mw2422 [21:41:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2421.codfw.wmnet with reason: host reimage [21:41:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mw2422 [21:42:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [21:43:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T328817)', diff saved to https://phabricator.wikimedia.org/P43897 and previous config saved to /var/cache/conftool/dbconfig/20230208-214322-marostegui.json [21:43:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [21:43:26] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [21:43:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [21:43:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43898 and previous config saved to /var/cache/conftool/dbconfig/20230208-214343-marostegui.json [21:48:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43899 and previous config saved to /var/cache/conftool/dbconfig/20230208-214849-marostegui.json [21:48:53] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [21:56:28] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:57:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2422.codfw.wmnet with reason: host reimage [21:58:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:58:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2421.codfw.wmnet with OS buster [21:58:44] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2421.codfw.wmnet with OS buster completed: - mw2421 (**PASS**) - Removed from Pupp... [22:01:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2422.codfw.wmnet with reason: host reimage [22:03:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P43900 and previous config saved to /var/cache/conftool/dbconfig/20230208-220356-marostegui.json [22:05:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [22:05:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [22:05:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43901 and previous config saved to /var/cache/conftool/dbconfig/20230208-220532-ladsgroup.json [22:05:35] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [22:07:17] (03PS4) 10Urbanecm: [tox] Make running `tox` work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) [22:08:04] (03CR) 10Urbanecm: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm) [22:12:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2423.codfw.wmnet with OS buster [22:12:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2423.codfw.wmnet with OS buster [22:14:32] 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic-Icebox, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10BCornwall) [22:14:36] (03PS5) 10Urbanecm: [tox] Make running `tox` work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) [22:14:41] (03CR) 10Urbanecm: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm) [22:15:15] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887840 (https://phabricator.wikimedia.org/T325585) [22:15:17] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887840 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot) [22:15:54] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887840 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot) [22:16:54] (03PS1) 10RobH: Setting H745 to disallowed [software] - 10https://gerrit.wikimedia.org/r/887841 (https://phabricator.wikimedia.org/T329226) [22:17:00] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:17:26] (03CR) 10RobH: [C: 03+2] Setting H745 to disallowed [software] - 10https://gerrit.wikimedia.org/r/887841 (https://phabricator.wikimedia.org/T329226) (owner: 10RobH) [22:17:36] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10BCornwall) [22:17:58] (03Merged) 10jenkins-bot: Setting H745 to disallowed [software] - 10https://gerrit.wikimedia.org/r/887841 (https://phabricator.wikimedia.org/T329226) (owner: 10RobH) [22:19:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P43902 and previous config saved to /var/cache/conftool/dbconfig/20230208-221902-marostegui.json [22:20:58] legoktm: hi! if you're around, I'd appreciate some quick feedback re https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/887830/. the goal is to enable tox as a voting job...soon, so logos.php is validated by CI. thanks! [22:21:55] I might be [22:23:14] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.21 refs T325585 [22:23:18] T325585: 1.40.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T325585 [22:24:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:25:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2422.codfw.wmnet with OS buster [22:25:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2422.codfw.wmnet with OS buster completed: - mw2422 (**PASS**) - Removed from Pupp... [22:28:39] (03CR) 10Cwhite: [C: 03+1] "At first glance, this seems to shoehorn in ipv6 support. Tried to approximate this locally and was successful." [puppet] - 10https://gerrit.wikimedia.org/r/887804 (owner: 10Herron) [22:29:10] (03PS1) 10EoghanGaffney: Adds 'before' directive to docker::network in gitlab runner setup [puppet] - 10https://gerrit.wikimedia.org/r/887843 (https://phabricator.wikimedia.org/T329035) [22:29:33] (03CR) 10Cwhite: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/887767 (https://phabricator.wikimedia.org/T327161) (owner: 10Filippo Giunchedi) [22:29:44] !log demon@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.21 refs T325585 (duration: 06m 29s) [22:29:47] T325585: 1.40.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T325585 [22:32:00] (03CR) 10Cwhite: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [22:32:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2423.codfw.wmnet with reason: host reimage [22:33:18] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39476/console" [puppet] - 10https://gerrit.wikimedia.org/r/887843 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney) [22:34:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T328817)', diff saved to https://phabricator.wikimedia.org/P43903 and previous config saved to /var/cache/conftool/dbconfig/20230208-223408-marostegui.json [22:34:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [22:34:12] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [22:34:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [22:34:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T328817)', diff saved to https://phabricator.wikimedia.org/P43904 and previous config saved to /var/cache/conftool/dbconfig/20230208-223430-marostegui.json [22:35:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2423.codfw.wmnet with reason: host reimage [22:40:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T328817)', diff saved to https://phabricator.wikimedia.org/P43905 and previous config saved to /var/cache/conftool/dbconfig/20230208-224028-marostegui.json [22:40:32] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [22:43:12] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887765 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [22:51:48] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:55:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P43906 and previous config saved to /var/cache/conftool/dbconfig/20230208-225534-marostegui.json [22:58:03] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887765 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [23:05:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43907 and previous config saved to /var/cache/conftool/dbconfig/20230208-230550-ladsgroup.json [23:05:54] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [23:10:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P43908 and previous config saved to /var/cache/conftool/dbconfig/20230208-231041-marostegui.json [23:14:44] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Dzahn) >>! In T270034#6701977, @RLazarus wrote: > Thanks @Varnent for offering to look at this Any plans to still do that? [23:17:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887765 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [23:18:01] (03Merged) 10jenkins-bot: Change the trwiki logo with a temporary one (vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887765 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [23:18:26] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:887765|Change the trwiki logo with a temporary one (vector 2022) (T329047)]] [23:18:29] T329047: Temporary logo change for trwiki - https://phabricator.wikimedia.org/T329047 [23:20:15] !log urbanecm@deploy1002 superpes and urbanecm: Backport for [[gerrit:887765|Change the trwiki logo with a temporary one (vector 2022) (T329047)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [23:20:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P43909 and previous config saved to /var/cache/conftool/dbconfig/20230208-232056-ladsgroup.json [23:24:01] (03CR) 10Dzahn: "thank you for merging this so quickly, wow" [puppet] - 10https://gerrit.wikimedia.org/r/887434 (https://phabricator.wikimedia.org/T306015) (owner: 10Zabe) [23:25:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T328817)', diff saved to https://phabricator.wikimedia.org/P43910 and previous config saved to /var/cache/conftool/dbconfig/20230208-232547-marostegui.json [23:25:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [23:25:51] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [23:26:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [23:26:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T328817)', diff saved to https://phabricator.wikimedia.org/P43911 and previous config saved to /var/cache/conftool/dbconfig/20230208-232608-marostegui.json [23:26:58] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:887765|Change the trwiki logo with a temporary one (vector 2022) (T329047)]] (duration: 08m 32s) [23:27:01] T329047: Temporary logo change for trwiki - https://phabricator.wikimedia.org/T329047 [23:28:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T328817)', diff saved to https://phabricator.wikimedia.org/P43912 and previous config saved to /var/cache/conftool/dbconfig/20230208-232821-marostegui.json [23:33:49] (03CR) 10Krinkle: [C: 03+1] Enable profile::auto_restarts::service for Arclamp [puppet] - 10https://gerrit.wikimedia.org/r/887769 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [23:36:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P43913 and previous config saved to /var/cache/conftool/dbconfig/20230208-233603-ladsgroup.json [23:43:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P43915 and previous config saved to /var/cache/conftool/dbconfig/20230208-234327-marostegui.json [23:51:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43916 and previous config saved to /var/cache/conftool/dbconfig/20230208-235109-ladsgroup.json [23:51:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [23:51:13] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [23:51:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [23:51:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [23:51:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [23:52:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43917 and previous config saved to /var/cache/conftool/dbconfig/20230208-235157-ladsgroup.json [23:58:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P43918 and previous config saved to /var/cache/conftool/dbconfig/20230208-235833-marostegui.json