[00:04:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T60674)', diff saved to https://phabricator.wikimedia.org/P29289 and previous config saved to /var/cache/conftool/dbconfig/20220601-000448-ladsgroup.json [00:04:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [00:04:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [00:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:55] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [00:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:30] (03CR) 10Krinkle: [C: 04-1] Add language fallback support for wmgSiteLogoVariants (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [00:09:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-sqoop-whole-mediawiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup2009 - https://phabricator.wikimedia.org/T307049 (10Papaul) this is complete ` Disk /dev/sda: 446.63 GiB, 479559942144 bytes, 936640512 sectors Disk /dev/sdb: 446.63 GiB, 479559942144 bytes, 936640512 secto... [00:12:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P29290 and previous config saved to /var/cache/conftool/dbconfig/20220601-001234-ladsgroup.json [00:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:46] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:16] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:18:26] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [00:21:24] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:50] (03CR) 10Krinkle: profiler: Turn from functions into class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796300 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [00:22:54] (03PS2) 10Krinkle: profiler: Turn from functions into class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796300 (https://phabricator.wikimedia.org/T308932) [00:24:38] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [00:24:58] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-05-24 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:24:58] PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-05-24 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:25:53] (03PS1) 10Krinkle: Profiler: Update wmfSetupProfiler() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801831 (https://phabricator.wikimedia.org/T308932) [00:25:55] (03PS1) 10Krinkle: Profiler: Remove temporary back-compat for wmfSetupProfiler() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801832 (https://phabricator.wikimedia.org/T308932) [00:26:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [00:26:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [00:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T60674)', diff saved to https://phabricator.wikimedia.org/P29291 and previous config saved to /var/cache/conftool/dbconfig/20220601-002701-ladsgroup.json [00:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:12] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [00:27:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T309311)', diff saved to https://phabricator.wikimedia.org/P29292 and previous config saved to /var/cache/conftool/dbconfig/20220601-002739-ladsgroup.json [00:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:45] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [00:54:58] RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-05-31 00:00:01 (3098 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:55:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T60674)', diff saved to https://phabricator.wikimedia.org/P29293 and previous config saved to /var/cache/conftool/dbconfig/20220601-005537-ladsgroup.json [00:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:47] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [01:00:42] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:01:48] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:03:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:03:11] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.323 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:10:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P29294 and previous config saved to /var/cache/conftool/dbconfig/20220601-011043-ladsgroup.json [01:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:22] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:18] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:25:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P29295 and previous config saved to /var/cache/conftool/dbconfig/20220601-012548-ladsgroup.json [01:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T60674)', diff saved to https://phabricator.wikimedia.org/P29296 and previous config saved to /var/cache/conftool/dbconfig/20220601-014053-ladsgroup.json [01:40:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [01:40:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [01:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:00] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [01:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:58] (03PS2) 10Tim Starling: Enable SSL for master DB connections in the secondary datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799437 (https://phabricator.wikimedia.org/T134809) [01:59:00] (03PS2) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) [01:59:02] (03PS1) 10Tim Starling: Clean up scap sequencing workaround [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 [02:03:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [02:03:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [02:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T60674)', diff saved to https://phabricator.wikimedia.org/P29297 and previous config saved to /var/cache/conftool/dbconfig/20220601-020339-ladsgroup.json [02:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:50] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [02:15:56] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:16:00] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:18:42] (03PS3) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) [02:18:44] (03PS2) 10Tim Starling: Clean up scap sequencing workaround [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 [02:23:47] (03CR) 10Tim Starling: "Per collab meeting comments:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [02:28:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T60674)', diff saved to https://phabricator.wikimedia.org/P29298 and previous config saved to /var/cache/conftool/dbconfig/20220601-022851-ladsgroup.json [02:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:59] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [02:43:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P29299 and previous config saved to /var/cache/conftool/dbconfig/20220601-024356-ladsgroup.json [02:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:46] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:38] (03PS5) 10Tim Starling: [WIP] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) [02:45:47] (03CR) 10Tim Starling: [WIP] Implement MediaWiki multi-DC traffic component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [02:51:30] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:56:27] (03PS6) 10Tim Starling: [WIP] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) [02:59:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P29300 and previous config saved to /var/cache/conftool/dbconfig/20220601-025901-ladsgroup.json [02:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T60674)', diff saved to https://phabricator.wikimedia.org/P29301 and previous config saved to /var/cache/conftool/dbconfig/20220601-031406-ladsgroup.json [03:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:14] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [03:26:06] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:28:10] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.066 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:43:46] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:44:54] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:32] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:18] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 96 probes of 671 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:13:30] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 68 probes of 671 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:41:58] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:44:50] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:45:00] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:49:08] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 99 probes of 671 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:51:02] (03PS1) 10Marostegui: x2: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/801841 (https://phabricator.wikimedia.org/T306118) [04:51:42] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:52:06] (03CR) 10Marostegui: [C: 03+2] x2: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/801841 (https://phabricator.wikimedia.org/T306118) (owner: 10Marostegui) [04:53:06] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:54:07] (03CR) 10Tchanders: [C: 03+1] MetaContactPages: Update reference to `ext.wikimediamessages.contactpage` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801423 (owner: 10Krinkle) [04:55:11] 10SRE, 10Wikibugs: wikibugs has stopped showing phab/gerrit comments on IRC as of 2022-05-22Z17:00 - https://phabricator.wikimedia.org/T308995 (10Marostegui) Thank you! [04:55:26] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 63 probes of 671 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:00:32] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (netboxdb2002), Fresh: 113 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:01:44] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:02:47] 10SRE, 10ops-codfw, 10DBA: db2088 crashed - https://phabricator.wikimedia.org/T309485 (10Marostegui) Thanks Papaul. I can indeed access the host now. MySQL seems to be fine. I am going to repool this host once it catches up and close this. If it happens again, we can probably decommission it as it is sche... [05:02:50] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:04:46] <_joe_> that doesn't look good. [05:09:27] (03CR) 10Tim Starling: "I cherry-picked this to deployment-prep and edited the hieradata in horizon to make the new script be used. The results were not what I ex" [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [05:14:06] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:42] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:20:50] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:44:22] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:36] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:48] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:01:24] PROBLEM - Host an-worker1094 is DOWN: PING CRITICAL - Packet loss = 100% [06:01:38] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 114 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:13:15] (03CR) 10Aaron Schulz: [C: 03+1] Enable SSL for master DB connections in the secondary datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799437 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [06:14:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1076-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:14:30] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:44] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:21:04] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:24] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Provide a python3-bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/800213 (owner: 10Majavah) [06:52:10] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] "image published." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/800213 (owner: 10Majavah) [06:58:05] (03PS1) 10David Caro: wmcs: Added task, ircmail and page routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 [07:00:04] Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220601T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:03:28] oof. [07:04:00] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:04:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1076-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:05:52] (03PS2) 10David Caro: wmcs: Added task, ircmail and page routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 [07:08:07] (03CR) 10David Caro: wmcs: Added task, ircmail and page routings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [07:14:14] (03CR) 10Majavah: wmcs: Added task, ircmail and page routings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [07:18:47] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10thiemowmde) [07:21:54] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:26:36] (03CR) 10Awight: profiler: Turn from functions into class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796300 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [07:30:14] 10SRE, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Stuartyeates) I think I've just been struck by this bug. https://en.wikipedia.org/w/index.php?title=User_talk%3AStuartyeates&type... [07:34:40] !log installing libxml2 security updates [07:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:13] (03PS3) 10David Caro: wmcs: Added task, ircmail and page routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 [07:42:15] (03CR) 10David Caro: wmcs: Added task, ircmail and page routings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [07:44:57] (03PS1) 10David Caro: wmcs: add a few alerts specific to wmcs [alerts] - 10https://gerrit.wikimedia.org/r/802045 [07:44:58] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:29] (03CR) 10CI reject: [V: 04-1] wmcs: add a few alerts specific to wmcs [alerts] - 10https://gerrit.wikimedia.org/r/802045 (owner: 10David Caro) [07:49:44] (03PS2) 10David Caro: wmcs: add a few alerts specific to wmcs [alerts] - 10https://gerrit.wikimedia.org/r/802045 [07:51:50] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:53:16] 10SRE, 10Search-Console-access-request: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10SCherukuwada) Access has now been revoked. [07:53:20] (03CR) 10Slyngshede: [C: 03+2] aptrepo::repo move validation command to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/801705 (owner: 10Slyngshede) [07:54:04] (03CR) 10David Caro: wmcs: Added task, ircmail and page routings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [08:00:08] !log installing idp2002 T308214 [08:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:15] T308214: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 [08:02:06] (03PS1) 10Ayounsi: Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802046 (https://phabricator.wikimedia.org/T262446) [08:04:26] (03PS1) 10Muehlenhoff: Add idp1002/2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/802048 [08:05:02] (03PS2) 10Samtar: ugwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431) [08:05:08] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:05:21] (03PS2) 10Samtar: crhwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800856 (https://phabricator.wikimedia.org/T309431) [08:06:15] (03CR) 10Muehlenhoff: [C: 03+2] Add idp1002/2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/802048 (owner: 10Muehlenhoff) [08:06:54] 10SRE, 10Search-Console-access-request: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10RhinosF1) Should this be closed then? [08:10:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1137 for migration to 10.6 T309679', diff saved to https://phabricator.wikimedia.org/P29305 and previous config saved to /var/cache/conftool/dbconfig/20220601-081044-root.json [08:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:52] T309679: Migrate a x1 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T309679 [08:11:14] (03PS1) 10Marostegui: db1137: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/802049 (https://phabricator.wikimedia.org/T309679) [08:11:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add some weight to x1 master', diff saved to https://phabricator.wikimedia.org/P29306 and previous config saved to /var/cache/conftool/dbconfig/20220601-081130-marostegui.json [08:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [08:11:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [08:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:30] (03CR) 10Marostegui: [C: 03+2] db1137: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/802049 (https://phabricator.wikimedia.org/T309679) (owner: 10Marostegui) [08:13:45] (03PS2) 10Muehlenhoff: chartmuseum: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801635 (https://phabricator.wikimedia.org/T308013) [08:14:40] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:34] (03CR) 10Muehlenhoff: [C: 03+2] chartmuseum: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801635 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:15:53] (03PS1) 10Jbond: CONTRIBUTORS: add Thiemo Kreuz [puppet] - 10https://gerrit.wikimedia.org/r/802050 (https://phabricator.wikimedia.org/T308013) [08:16:25] (03CR) 10Jbond: [V: 03+2 C: 03+2] CONTRIBUTORS: add Thiemo Kreuz [puppet] - 10https://gerrit.wikimedia.org/r/802050 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [08:17:22] (03PS2) 10Zabe: graphite: remove absented update_graphite_index cron [puppet] - 10https://gerrit.wikimedia.org/r/779023 (https://phabricator.wikimedia.org/T273673) [08:19:45] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1043.eqiad.wmnet with OS bullseye [08:19:50] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1043.eqiad.wmnet with OS bullseye [08:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:18] !log installing openssl security updates [08:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:32] (03PS1) 10Marostegui: db1137: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/802051 (https://phabricator.wikimedia.org/T309679) [08:21:05] (03PS2) 10Zabe: snmp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800248 (https://phabricator.wikimedia.org/T308013) [08:21:34] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:39] (03CR) 10Zabe: snmp: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800248 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:21:55] (03CR) 10Marostegui: [C: 03+2] db1137: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/802051 (https://phabricator.wikimedia.org/T309679) (owner: 10Marostegui) [08:23:45] (03CR) 10Filippo Giunchedi: [C: 03+2] ldap-corp: disable paging [puppet] - 10https://gerrit.wikimedia.org/r/801723 (https://phabricator.wikimedia.org/T244792) (owner: 10Filippo Giunchedi) [08:23:51] (03PS2) 10Filippo Giunchedi: ldap-corp: disable paging [puppet] - 10https://gerrit.wikimedia.org/r/801723 (https://phabricator.wikimedia.org/T244792) [08:24:03] (03PS2) 10Zabe: r_lang: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800254 (https://phabricator.wikimedia.org/T308013) [08:24:14] (03CR) 10Filippo Giunchedi: [V: 03+2] ldap-corp: disable paging [puppet] - 10https://gerrit.wikimedia.org/r/801723 (https://phabricator.wikimedia.org/T244792) (owner: 10Filippo Giunchedi) [08:24:48] (03CR) 10Zabe: r_lang: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800254 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:26:30] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: remove absented update_graphite_index cron [puppet] - 10https://gerrit.wikimedia.org/r/779023 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:26:34] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] maintenance::wikidata: Update cron with lb and lb-pool params [puppet] - 10https://gerrit.wikimedia.org/r/797077 (https://phabricator.wikimedia.org/T238751) (owner: 10Giuseppe Lavagetto) [08:28:24] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [08:28:43] (03CR) 10Volans: "@Arzhel, I wish you had told me before starting to work on this. I have a local branch where I had started a huge refactor of this script " [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802046 (https://phabricator.wikimedia.org/T262446) (owner: 10Ayounsi) [08:30:22] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1043.eqiad.wmnet with OS bullseye [08:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:28] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1043.eqiad.wmnet with OS bullseye executed with errors: - ms-be1043 (**FAIL**)... [08:30:33] (03CR) 10Filippo Giunchedi: "Thank you for tackling this! I'm +1 on the idea in general, however as it stands I believe both alerts (sre and wmcs) will fire since sre'" [alerts] - 10https://gerrit.wikimedia.org/r/802045 (owner: 10David Caro) [08:30:59] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1043.eqiad.wmnet with OS bullseye [08:31:03] (03CR) 10Filippo Giunchedi: [C: 03+2] blackbox: add IRC probe module [puppet] - 10https://gerrit.wikimedia.org/r/801714 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:08] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1043.eqiad.wmnet with OS bullseye [08:34:13] (03PS1) 10Muehlenhoff: acmechief: Remove old buster staging IDPs [puppet] - 10https://gerrit.wikimedia.org/r/802052 [08:34:54] (03PS1) 10Giuseppe Lavagetto: Revert "maintenance::wikidata: Update cron with lb and lb-pool params" [puppet] - 10https://gerrit.wikimedia.org/r/801762 [08:35:25] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/800248 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:35:49] (03PS2) 10Giuseppe Lavagetto: Revert "maintenance::wikidata: Update cron with lb and lb-pool params" [puppet] - 10https://gerrit.wikimedia.org/r/801762 (https://phabricator.wikimedia.org/T238751) [08:36:24] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:01] 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Joe) Sadly I had to revert, because the `--lb` and the `--lb-pool` commands are not recognized by the script. ` mwm... [08:37:05] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "maintenance::wikidata: Update cron with lb and lb-pool params" [puppet] - 10https://gerrit.wikimedia.org/r/801762 (https://phabricator.wikimedia.org/T238751) (owner: 10Giuseppe Lavagetto) [08:37:43] 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Addshore) *looks at the script closer* [08:39:04] !log powercycle an-worker1094 - OEM event registered in `racadm getsel`, host frozen [08:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:10] 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Addshore) Right, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikidata.org/+/552544 needs to go in first! (... [08:39:12] RECOVERY - Host an-worker1094 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [08:40:29] (03CR) 10David Caro: wmcs: add a few alerts specific to wmcs (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/802045 (owner: 10David Caro) [08:41:32] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:41:35] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:19] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:37] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:42] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:12] (03PS1) 10Zabe: vm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802060 (https://phabricator.wikimedia.org/T308013) [08:44:14] (03PS1) 10Zabe: system: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802061 (https://phabricator.wikimedia.org/T308013) [08:44:18] (03PS1) 10Zabe: sysctl: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802062 (https://phabricator.wikimedia.org/T308013) [08:44:20] (03PS1) 10Zabe: strongswan: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802063 (https://phabricator.wikimedia.org/T308013) [08:44:22] (03PS1) 10Zabe: statsd_proxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802064 (https://phabricator.wikimedia.org/T308013) [08:44:24] (03PS1) 10Zabe: sonofgridengine: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802065 (https://phabricator.wikimedia.org/T308013) [08:44:26] (03PS1) 10Zabe: reposync: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802066 (https://phabricator.wikimedia.org/T308013) [08:45:02] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1043.eqiad.wmnet with reason: host reimage [08:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:53] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/800254 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:48:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1043.eqiad.wmnet with reason: host reimage [08:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:39] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:11] !log installing idp1002 T308214 [08:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:18] T308214: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 [08:53:13] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:55:21] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:56:00] (03CR) 10Filippo Giunchedi: wmcs: add a few alerts specific to wmcs (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/802045 (owner: 10David Caro) [08:56:06] (03CR) 10Vgutierrez: [C: 03+1] acmechief: Remove old buster staging IDPs [puppet] - 10https://gerrit.wikimedia.org/r/802052 (owner: 10Muehlenhoff) [08:56:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1137 in x1 with minimal weight to test 10.6.8 T309679 ', diff saved to https://phabricator.wikimedia.org/P29307 and previous config saved to /var/cache/conftool/dbconfig/20220601-085620-marostegui.json [08:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:29] T309679: Migrate a x1 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T309679 [08:56:42] moritzm: BTW, do you really need the specific hostsnames behind idp-test.wm.o on the SAN list? [08:59:35] maybe, maybe not. totally depends on the application integrated into the IDPs. and rather than introducing subtle breakage, better include them [09:00:18] 10SRE, 10Search-Console-access-request: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10SCherukuwada) 05Open→03Resolved [09:01:22] (03PS1) 10Filippo Giunchedi: hieradata: TCP probe for ldap-ro [puppet] - 10https://gerrit.wikimedia.org/r/802071 (https://phabricator.wikimedia.org/T305847) [09:03:15] (03CR) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [09:03:24] (03PS1) 10Elukey: Add BGP configuration for the new ML staging codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) [09:04:18] (03PS2) 10Elukey: Add BGP configuration for the new ML staging codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) [09:05:02] (03CR) 10CI reject: [V: 04-1] Add BGP configuration for the new ML staging codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [09:05:18] (03CR) 10Jbond: [C: 03+2] move more nrpe checks to nrpe::plugin and sudo_user [puppet] - 10https://gerrit.wikimedia.org/r/801665 (owner: 10Majavah) [09:08:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1043.eqiad.wmnet with OS bullseye [09:08:38] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1043.eqiad.wmnet with OS bullseye completed: - ms-be1043 (**PASS**) - Removed... [09:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:39] (03PS3) 10Elukey: Add BGP configuration for the new ML staging codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) [09:10:07] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:30] (03CR) 10CI reject: [V: 04-1] Add BGP configuration for the new ML staging codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [09:10:36] uff [09:11:59] (03PS1) 10Marostegui: Revert "db1137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/801763 [09:12:30] (03CR) 10CI reject: [V: 04-1] Revert "db1137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/801763 (owner: 10Marostegui) [09:12:57] (03Abandoned) 10Marostegui: Revert "db1137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/801763 (owner: 10Marostegui) [09:13:35] (03PS1) 10Marostegui: db1137: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/802073 (https://phabricator.wikimedia.org/T309679) [09:14:51] (03PS4) 10Elukey: Add BGP configuration for the new ML staging codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) [09:15:10] this is going to get a -1 again, sorry for the spam [09:15:35] (03CR) 10CI reject: [V: 04-1] Add BGP configuration for the new ML staging codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [09:16:01] (03CR) 10Jbond: [C: 03+2] vm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802060 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:16:05] (03CR) 10Jbond: [C: 03+2] system: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802061 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:16:09] (03CR) 10Jbond: [C: 03+2] sysctl: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802062 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:16:13] (03CR) 10Jbond: [C: 03+2] strongswan: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802063 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:16:17] (03CR) 10Jbond: [C: 03+2] statsd_proxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802064 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:16:20] (03CR) 10Jbond: [C: 03+2] sonofgridengine: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802065 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:16:23] (03CR) 10Jbond: [C: 03+2] reposync: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802066 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:16:24] moritzm: ack [09:16:43] (03CR) 10Jbond: [C: 03+2] "Thanks for all these <3 merging them all" [puppet] - 10https://gerrit.wikimedia.org/r/802066 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:18:01] (03PS4) 10David Caro: wmcs: Added task, ircmail and page routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 [09:18:03] (03PS1) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team [puppet] - 10https://gerrit.wikimedia.org/r/802074 [09:18:06] (03PS1) 10Ladsgroup: Remove EnableLocalTimedText from SpecialOrphanedTimedText [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/801764 (https://phabricator.wikimedia.org/T309677) [09:18:19] (03Abandoned) 10David Caro: wmcs: add a few alerts specific to wmcs [alerts] - 10https://gerrit.wikimedia.org/r/802045 (owner: 10David Caro) [09:18:21] jouncebot: nowandnext [09:18:21] No deployments scheduled for the next 3 hour(s) and 41 minute(s) [09:18:21] In 3 hour(s) and 41 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220601T1300) [09:18:41] (03CR) 10Ladsgroup: [C: 03+2] Remove EnableLocalTimedText from SpecialOrphanedTimedText [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/801764 (https://phabricator.wikimedia.org/T309677) (owner: 10Ladsgroup) [09:18:47] (03PS4) 10JMeybohm: Remove null creationTimestamp from CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/792267 (https://phabricator.wikimedia.org/T306165) [09:18:51] (03PS3) 10JMeybohm: Add crds.yaml fixtures to charts and istio schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/793509 (https://phabricator.wikimedia.org/T306165) [09:18:55] (03PS11) 10JMeybohm: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) [09:21:53] (03PS5) 10Elukey: Add BGP configuration for the new ML staging codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) [09:23:04] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35638/console" [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [09:23:11] (03CR) 10Marostegui: [C: 03+2] db1137: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/802073 (https://phabricator.wikimedia.org/T309679) (owner: 10Marostegui) [09:23:46] (03CR) 10Filippo Giunchedi: "You'll need to pass alerting_relabel_configs_extra to prometheus::server from the various prometheus instances (e.g. modules/profile/manif" [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [09:23:59] dcaro: hope that's clear enough ^ [09:29:25] godog: I saw something was off from the PCC :), so adding it to all the instances then? [09:29:46] (services/cloud/labs/tools/global/...) [09:30:04] I'm guessing that labs/clod/tools/paws don't really need it xd [09:30:40] saying because an option is to add it as a default for the class [09:31:01] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35639/console" [puppet] - 10https://gerrit.wikimedia.org/r/802071 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:31:17] (03PS1) 10Jbond: check_netbox_report: add url to output [puppet] - 10https://gerrit.wikimedia.org/r/802075 [09:31:35] dcaro: yeah adding to the class would be fine too, it is harmless on instances that don't need it, that'd work for me too [09:31:49] (03PS1) 10Slyngshede: aptrepo::repo Allow seperate distributions files per reprepro conf. [puppet] - 10https://gerrit.wikimedia.org/r/802076 [09:32:10] 10SRE, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Sgs) Great. Thank you for helping with the process @Dzahn! [09:32:58] (03CR) 10CI reject: [V: 04-1] check_netbox_report: add url to output [puppet] - 10https://gerrit.wikimedia.org/r/802075 (owner: 10Jbond) [09:33:09] (03CR) 10Aklapper: "This is ready to go now" [puppet] - 10https://gerrit.wikimedia.org/r/791321 (https://phabricator.wikimedia.org/T265018) (owner: 10Aklapper) [09:33:40] (03CR) 10Muehlenhoff: [C: 03+2] acmechief: Remove old buster staging IDPs [puppet] - 10https://gerrit.wikimedia.org/r/802052 (owner: 10Muehlenhoff) [09:35:01] (03Merged) 10jenkins-bot: Remove EnableLocalTimedText from SpecialOrphanedTimedText [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/801764 (https://phabricator.wikimedia.org/T309677) (owner: 10Ladsgroup) [09:35:22] (03CR) 10CI reject: [V: 04-1] aptrepo::repo Allow seperate distributions files per reprepro conf. [puppet] - 10https://gerrit.wikimedia.org/r/802076 (owner: 10Slyngshede) [09:36:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1044.eqiad.wmnet with OS bullseye [09:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:13] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1044.eqiad.wmnet with OS bullseye [09:36:21] (03CR) 10Elukey: "Folks I tried to change the JSON schema with what I thought made sense, but no strong opinions. Lemme know your thoughts!" [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [09:39:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:39:49] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35640/console" [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [09:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:30] (03PS2) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team [puppet] - 10https://gerrit.wikimedia.org/r/802074 [09:42:08] (03PS5) 10David Caro: wmcs: Added task, ircmail and page routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 [09:42:10] (03PS3) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team [puppet] - 10https://gerrit.wikimedia.org/r/802074 [09:42:20] (03Abandoned) 10Muehlenhoff: Only apply automated restarts for imagecatalog on the active deployment server [puppet] - 10https://gerrit.wikimedia.org/r/775822 (https://phabricator.wikimedia.org/T305135) (owner: 10Muehlenhoff) [09:43:46] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.14/extensions/TimedMediaHandler/includes/SpecialOrphanedTimedText.php: Backport: [[gerrit:801764|Remove EnableLocalTimedText from SpecialOrphanedTimedText (T309677)]] (duration: 03m 09s) [09:43:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:43:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:44:41] (03CR) 10Vgutierrez: [C: 03+1] "sorry about the delay!" [puppet] - 10https://gerrit.wikimedia.org/r/791678 (owner: 10Dzahn) [09:45:00] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:02] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35642/console" [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [09:45:41] (03CR) 10JMeybohm: [C: 03+2] Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [09:45:44] (03CR) 10JMeybohm: [C: 03+2] Add crds.yaml fixtures to charts and istio schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/793509 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [09:45:47] (03CR) 10JMeybohm: [C: 03+2] Remove null creationTimestamp from CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/792267 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [09:47:11] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2027.codfw.wmnet [09:47:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:48:05] (03PS2) 10Muehlenhoff: matomo: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801640 (https://phabricator.wikimedia.org/T308013) [09:50:56] (03Merged) 10jenkins-bot: Remove null creationTimestamp from CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/792267 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [09:51:02] (03Merged) 10jenkins-bot: Add crds.yaml fixtures to charts and istio schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/793509 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [09:51:04] (03Merged) 10jenkins-bot: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [09:52:30] (03PS2) 10Lucas Werkmeister (WMDE): query_service: don’t cache index files [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243) [09:52:32] (03PS1) 10Lucas Werkmeister (WMDE): httpbb: Add basic tests for query_service (WDQS, WCQS) [puppet] - 10https://gerrit.wikimedia.org/r/802079 [09:52:55] (03PS2) 10Slyngshede: aptrepo::repo Allow seperate distributions files per reprepro conf. [puppet] - 10https://gerrit.wikimedia.org/r/802076 [09:53:06] (03CR) 10Lucas Werkmeister (WMDE): "Now with some httpbb tests following advice on IRC yesterday :)" [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243) (owner: 10Lucas Werkmeister (WMDE)) [09:53:12] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1044.eqiad.wmnet with reason: host reimage [09:53:17] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:53:32] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:56:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1044.eqiad.wmnet with reason: host reimage [09:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:08] (03CR) 10CI reject: [V: 04-1] httpbb: Add basic tests for query_service (WDQS, WCQS) [puppet] - 10https://gerrit.wikimedia.org/r/802079 (owner: 10Lucas Werkmeister (WMDE)) [09:57:12] (03PS4) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team [puppet] - 10https://gerrit.wikimedia.org/r/802074 [09:59:17] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry) [10:00:38] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35643/console" [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [10:00:58] (03PS2) 10Lucas Werkmeister (WMDE): httpbb: Add basic tests for query_service (WDQS, WCQS) [puppet] - 10https://gerrit.wikimedia.org/r/802079 [10:01:00] (03PS3) 10Lucas Werkmeister (WMDE): query_service: don’t cache index files [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243) [10:01:31] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/801766 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [10:03:22] (03CR) 10David Caro: [V: 03+1] "Hi @Jbond, I seem to be unable to inject the default through hiera, I might be just doing something dumb, is there anything that pops out " [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [10:03:51] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:28] (03CR) 10Majavah: wmcs: Added task, ircmail and page routings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [10:05:44] (03CR) 10Jbond: [C: 03+1] aptrepo::repo Allow seperate distributions files per reprepro conf. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802076 (owner: 10Slyngshede) [10:06:22] (03CR) 10Vgutierrez: [C: 03+1] delete expired globalsign-2018/2019 certs. [puppet] - 10https://gerrit.wikimedia.org/r/791673 (owner: 10Dzahn) [10:08:09] (03CR) 10Muehlenhoff: "Looks good, one comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/802076 (owner: 10Slyngshede) [10:08:51] 10SRE, 10Release-Engineering-Team, 10Scap, 10serviceops: Deploy Scap version 4.8.0 - https://phabricator.wikimedia.org/T309116 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:09:04] (03CR) 10Filippo Giunchedi: prometheus::blackbox::check: add new blackbox exporter check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [10:09:21] 10SRE, 10SRE-OnFire, 10Observability-Logging, 10Sustainability (Incident Followup), 10Wikimedia-Incident: create a sampled log of POST data - https://phabricator.wikimedia.org/T309186 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:09:38] 10SRE, 10Wikimedia-Mailing-lists: [[mail:]] should redirect to the main page https://lists.wikimedia.org/postorius/lists/ - https://phabricator.wikimedia.org/T309558 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:09:53] (03CR) 10Majavah: wmcs: relabel alerts from wmcs cluster with wmcs team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [10:09:57] (03CR) 10David Caro: wmcs: Added task, ircmail and page routings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [10:11:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1044.eqiad.wmnet with OS bullseye [10:11:05] jbond: thanks for the feedback re: prometheus http checks, I'm in a bit of a rush but let's talk next week [10:11:08] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1044.eqiad.wmnet with OS bullseye completed: - ms-be1044 (**PASS**) - Downtim... [10:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:43] (03CR) 10Majavah: wmcs: Added task, ircmail and page routings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [10:12:31] (03CR) 10Filippo Giunchedi: wmcs: relabel alerts from wmcs cluster with wmcs team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [10:13:24] !log installing openldap security updates [10:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:43] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:08] ok I have to go, tty on Monday [10:19:18] (03PS1) 10Muehlenhoff: Remove twentyafterfour from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/802088 [10:23:40] (03CR) 10Muehlenhoff: [C: 03+2] Remove twentyafterfour from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/802088 (owner: 10Muehlenhoff) [10:27:26] (03PS1) 10Jelto: wikimedia.org: reduce TTL for gitlab A and AAAA to 5m [dns] - 10https://gerrit.wikimedia.org/r/802090 (https://phabricator.wikimedia.org/T307142) [10:28:47] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1045.eqiad.wmnet with OS bullseye [10:28:51] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1045.eqiad.wmnet with OS bullseye [10:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:58] (03PS1) 10Majavah: move more nrpe checks to nrpe::plugin and sudo_user [puppet] - 10https://gerrit.wikimedia.org/r/802091 [10:41:11] (03PS2) 10Majavah: move more nrpe checks to nrpe::plugin and sudo_user [puppet] - 10https://gerrit.wikimedia.org/r/802091 [10:42:56] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35645/console" [puppet] - 10https://gerrit.wikimedia.org/r/802091 (owner: 10Majavah) [10:45:46] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1045.eqiad.wmnet with reason: host reimage [10:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:27] (03PS3) 10Slyngshede: aptrepo::repo Allow seperate distributions files per reprepro conf. [puppet] - 10https://gerrit.wikimedia.org/r/802076 [10:47:02] (03CR) 10Slyngshede: aptrepo::repo Allow seperate distributions files per reprepro conf. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802076 (owner: 10Slyngshede) [10:47:36] (03CR) 10CI reject: [V: 04-1] aptrepo::repo Allow seperate distributions files per reprepro conf. [puppet] - 10https://gerrit.wikimedia.org/r/802076 (owner: 10Slyngshede) [10:48:18] 10SRE-OnFire: Incident: 2022-05-09 confctl - https://phabricator.wikimedia.org/T309691 (10LSobanski) [10:48:21] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1045.eqiad.wmnet with reason: host reimage [10:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:45] !log upgrade fastnetmon to 1.2.1 in eqiad - T271228 [10:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:48] T271228: Upgrade Fastnetmon to 1.2.1 - https://phabricator.wikimedia.org/T271228 [10:49:55] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:50:09] (03PS4) 10Slyngshede: aptrepo::repo Allow seperate distributions files per reprepro conf. [puppet] - 10https://gerrit.wikimedia.org/r/802076 [10:50:37] (03PS1) 10Muehlenhoff: Record LDAP access for ozhang [puppet] - 10https://gerrit.wikimedia.org/r/802093 (https://phabricator.wikimedia.org/T309559) [10:51:05] (03CR) 10CI reject: [V: 04-1] aptrepo::repo Allow seperate distributions files per reprepro conf. [puppet] - 10https://gerrit.wikimedia.org/r/802076 (owner: 10Slyngshede) [10:51:37] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:46] !log upgrade fastnetmon to 1.2.1 in esams - T271228 [10:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:36] (03PS5) 10Slyngshede: aptrepo::repo Allow seperate distributions files per reprepro conf. [puppet] - 10https://gerrit.wikimedia.org/r/802076 [10:52:49] (03CR) 10Muehlenhoff: [C: 03+2] Record LDAP access for ozhang [puppet] - 10https://gerrit.wikimedia.org/r/802093 (https://phabricator.wikimedia.org/T309559) (owner: 10Muehlenhoff) [10:53:25] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:53:26] 10SRE-OnFire: Incident: 2022-05-09 confctl - https://phabricator.wikimedia.org/T309691 (10LSobanski) [10:53:47] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to nda for ozhang - https://phabricator.wikimedia.org/T309559 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Alright! I've enabled @OZhang-WMF's access, you should now be able to access https://superset.wikimedia.org. If it... [10:54:30] !log upgrade fastnetmon to 1.2.1 in eqsin - T271228 [10:54:32] (03PS1) 10Ladsgroup: Don't call saveOptions in LocalUserCreated [extensions/PageTriage] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802106 (https://phabricator.wikimedia.org/T306636) [10:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:46] (03PS1) 10Ladsgroup: Don't call saveOptions in LocalUserCreated [extensions/PageTriage] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/802107 (https://phabricator.wikimedia.org/T306636) [10:56:18] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade Fastnetmon to 1.2.1 - https://phabricator.wikimedia.org/T271228 (10ayounsi) 05Open→03Resolved a:03ayounsi All done! [10:57:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/802076 (owner: 10Slyngshede) [10:59:53] (03CR) 10Slyngshede: [C: 03+2] aptrepo::repo Allow seperate distributions files per reprepro conf. [puppet] - 10https://gerrit.wikimedia.org/r/802076 (owner: 10Slyngshede) [11:00:07] (03CR) 10Ayounsi: [C: 03+1] Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [11:01:33] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10ayounsi) Indeed, that looks great! [11:01:47] jouncebot: nowandnext [11:01:47] No deployments scheduled for the next 1 hour(s) and 58 minute(s) [11:01:47] In 1 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220601T1300) [11:01:52] cool [11:01:59] (03CR) 10Ladsgroup: [C: 03+2] Don't call saveOptions in LocalUserCreated [extensions/PageTriage] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802106 (https://phabricator.wikimedia.org/T306636) (owner: 10Ladsgroup) [11:02:05] (03CR) 10Ladsgroup: [C: 03+2] Don't call saveOptions in LocalUserCreated [extensions/PageTriage] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/802107 (https://phabricator.wikimedia.org/T306636) (owner: 10Ladsgroup) [11:04:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:04:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:01] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:08:48] (03Merged) 10jenkins-bot: Don't call saveOptions in LocalUserCreated [extensions/PageTriage] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802106 (https://phabricator.wikimedia.org/T306636) (owner: 10Ladsgroup) [11:10:31] (03Merged) 10jenkins-bot: Don't call saveOptions in LocalUserCreated [extensions/PageTriage] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/802107 (https://phabricator.wikimedia.org/T306636) (owner: 10Ladsgroup) [11:13:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:03] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:14:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:41] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.14/extensions/PageTriage/includes/Hooks.php: Backport: [[gerrit:802106|Don't call saveOptions in LocalUserCreated (T306636)]] (duration: 03m 01s) [11:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:44] T306636: UserOptionsManager: DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction ([db])Function: MediaWiki\User\UserOptionsManager::saveOptionsInternalQuery - https://phabricator.wikimedia.org/T306636 [11:16:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1045.eqiad.wmnet with OS bullseye [11:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:46] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1045.eqiad.wmnet with OS bullseye completed: - ms-be1045 (**PASS**) - Downtim... [11:18:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1137 in x1 with minimal weight to test 10.6.8 T309679 ', diff saved to https://phabricator.wikimedia.org/P29312 and previous config saved to /var/cache/conftool/dbconfig/20220601-111805-marostegui.json [11:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:10] T309679: Migrate a x1 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T309679 [11:19:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:31] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:58] (03PS1) 10Jbond: P:netbox: Add http proxy support to reports [puppet] - 10https://gerrit.wikimedia.org/r/802095 (https://phabricator.wikimedia.org/T296452) [11:21:00] (03PS1) 10Jbond: C:netbox: add support for proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/802096 (https://phabricator.wikimedia.org/T296452) [11:21:02] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.13/extensions/PageTriage/includes/Hooks.php: Backport: [[gerrit:802107|Don't call saveOptions in LocalUserCreated (T306636)]] (duration: 03m 16s) [11:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:06] T306636: UserOptionsManager: DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction ([db])Function: MediaWiki\User\UserOptionsManager::saveOptionsInternalQuery - https://phabricator.wikimedia.org/T306636 [11:21:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:23] (03CR) 10Ayounsi: [C: 03+1] "1 nit :)" [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [11:24:37] (03PS1) 10Kevin Bazira: ml-services: add ptwiki & ruwiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/802097 (https://phabricator.wikimedia.org/T307418) [11:25:52] (03PS2) 10Jbond: C:netbox: add support for proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/802096 (https://phabricator.wikimedia.org/T296452) [11:26:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35646/console" [puppet] - 10https://gerrit.wikimedia.org/r/802096 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:28:36] (03PS1) 10Muehlenhoff: profile::mariadb::ferm_misc: Remove old buster idp-test hosts, add new idp/bullseye ones [puppet] - 10https://gerrit.wikimedia.org/r/802098 (https://phabricator.wikimedia.org/T308214) [11:30:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [11:30:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [11:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T309617)', diff saved to https://phabricator.wikimedia.org/P29313 and previous config saved to /var/cache/conftool/dbconfig/20220601-113017-ladsgroup.json [11:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:22] T309617: Switchover s7 master (db1181 -> db1136) - https://phabricator.wikimedia.org/T309617 [11:31:29] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:netbox: add support for proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/802096 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:31:42] (03CR) 10CI reject: [V: 04-1] profile::mariadb::ferm_misc: Remove old buster idp-test hosts, add new idp/bullseye ones [puppet] - 10https://gerrit.wikimedia.org/r/802098 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [11:33:08] (03PS1) 10Majavah: nagios_common: remove -labs files [puppet] - 10https://gerrit.wikimedia.org/r/802099 [11:34:05] (03PS1) 10Jbond: hieradata: fix alias [puppet] - 10https://gerrit.wikimedia.org/r/802100 [11:34:26] (03CR) 10Jbond: [V: 03+2 C: 03+2] hieradata: fix alias [puppet] - 10https://gerrit.wikimedia.org/r/802100 (owner: 10Jbond) [11:36:22] PROBLEM - Check systemd state on netbox2002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:27] (03PS1) 10Jbond: netbox: quote strings [puppet] - 10https://gerrit.wikimedia.org/r/802102 [11:36:47] (03CR) 10Jbond: [V: 03+2 C: 03+2] netbox: quote strings [puppet] - 10https://gerrit.wikimedia.org/r/802102 (owner: 10Jbond) [11:38:18] (ProbeDown) firing: Service netbox:443 has failed probes (http_netbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:38:19] (ProbeDown) firing: Service netbox:443 has failed probes (http_netbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:38:38] * jbond looking [11:39:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T309617)', diff saved to https://phabricator.wikimedia.org/P29314 and previous config saved to /var/cache/conftool/dbconfig/20220601-113911-ladsgroup.json [11:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:16] T309617: Switchover s7 master (db1181 -> db1136) - https://phabricator.wikimedia.org/T309617 [11:39:40] (03PS2) 10Muehlenhoff: Remove old buster idp-test hosts, add new idp/bullseye ones from Ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/802098 (https://phabricator.wikimedia.org/T308214) [11:40:31] jbond: it should be a false alarm given it's a service::catalog-related alert and netbox production is not yet on that new infrastructure, correct? [11:40:48] volans: yes i think thats correct [11:41:48] ok I'm acking the alert [11:42:16] how to figure out what exactly is being checked? [11:42:52] I can't find the alert in alertmanager [11:43:18] (ProbeDown) resolved: Service netbox:443 has failed probes (http_netbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:19] (ProbeDown) resolved: Service netbox:443 has failed probes (http_netbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:45] XioNoX: i ythink the alert cleared straight away [11:43:56] oh ok [11:43:56] i pushed out a bad config to netbox1002 which i fixed pretty quickly [11:44:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1137 in x1 with minimal weight to test 10.6.8 T309679 ', diff saved to https://phabricator.wikimedia.org/P29315 and previous config saved to /var/cache/conftool/dbconfig/20220601-114418-marostegui.json [11:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:24] T309679: Migrate a x1 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T309679 [11:45:02] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:20] * jbond looking [11:47:02] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:40] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P29317 and previous config saved to /var/cache/conftool/dbconfig/20220601-115416-ladsgroup.json [11:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:58] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:56:12] (03CR) 10David Caro: wmcs: Added task, ircmail and page routings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [11:58:01] (03PS6) 10David Caro: wmcs: Added task, ircmail and page routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 [11:58:11] (03CR) 10David Caro: wmcs: Added task, ircmail and page routings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [11:58:51] (03PS1) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) [12:01:57] (03PS5) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team [puppet] - 10https://gerrit.wikimedia.org/r/802074 [12:01:59] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35649/console" [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [12:02:27] (03PS6) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team [puppet] - 10https://gerrit.wikimedia.org/r/802074 [12:03:25] (03PS2) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) [12:04:10] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35651/console" [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [12:04:38] (03PS1) 10Jbond: C:netbox: add http proxy config to uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/802105 (https://phabricator.wikimedia.org/T296452) [12:04:46] (03PS3) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) [12:05:33] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35652/console" [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [12:06:14] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35650/console" [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [12:07:32] (03PS4) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) [12:08:18] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35653/console" [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [12:09:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P29318 and previous config saved to /var/cache/conftool/dbconfig/20220601-120921-ladsgroup.json [12:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:19] (03PS5) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) [12:11:02] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35654/console" [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [12:12:02] (03PS1) 10Slyngshede: aptrepo::repo No - in nonfree, create missing deb-override file. [puppet] - 10https://gerrit.wikimedia.org/r/802126 [12:12:54] (03CR) 10CI reject: [V: 04-1] aptrepo::repo No - in nonfree, create missing deb-override file. [puppet] - 10https://gerrit.wikimedia.org/r/802126 (owner: 10Slyngshede) [12:14:49] (03PS2) 10Slyngshede: aptrepo::repo No - in nonfree, create missing deb-override file. [puppet] - 10https://gerrit.wikimedia.org/r/802126 [12:15:35] (03PS2) 10Giuseppe Lavagetto: rsyslog: do not use the same queue name for two logs [puppet] - 10https://gerrit.wikimedia.org/r/801628 [12:15:58] (03CR) 10Giuseppe Lavagetto: rsyslog: do not use the same queue name for two logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801628 (owner: 10Giuseppe Lavagetto) [12:16:46] (03CR) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [12:17:20] 10SRE, 10LDAP-Access-Requests: Add Evelien WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T309700 (10Evelien_WMDE) [12:19:08] (03PS1) 10Jbond: netbox: add proxy config to accounting report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802127 [12:19:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/802126 (owner: 10Slyngshede) [12:19:33] kostajh: we have another one :D https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Thanks/+/802128 [12:20:53] (03CR) 10Slyngshede: [C: 03+2] aptrepo::repo No - in nonfree, create missing deb-override file. [puppet] - 10https://gerrit.wikimedia.org/r/802126 (owner: 10Slyngshede) [12:24:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T309617)', diff saved to https://phabricator.wikimedia.org/P29320 and previous config saved to /var/cache/conftool/dbconfig/20220601-122426-ladsgroup.json [12:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:33] T309617: Switchover s7 master (db1181 -> db1136) - https://phabricator.wikimedia.org/T309617 [12:24:38] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:28:48] (03PS7) 10Stang: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) [12:29:10] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:29:15] (03PS8) 10Stang: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) [12:32:24] RECOVERY - Check systemd state on netbox2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:04] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:34] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:37] (03CR) 10David Caro: [C: 03+1] "LGTM, will wait for o11y to ack on the main prometheus changes" [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [12:33:59] (03CR) 10Klausman: [C: 03+1] role::pki::multirootca: add settings for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/801744 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [12:34:21] (03CR) 10Klausman: [C: 03+1] admin_ng: set cfssl-issuer's values for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/801766 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [12:34:56] (03PS1) 10Stang: zhwiki: Use wmgSiteLogoVariants to simplify logo variant settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802133 (https://phabricator.wikimedia.org/T308620) [12:35:40] (03CR) 10Stang: Add language fallback support for wmgSiteLogoVariants (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [12:36:00] (03CR) 10David Caro: [C: 03+2] P:metricsinfra::alertmanager: proxy access for trusted projects [puppet] - 10https://gerrit.wikimedia.org/r/795192 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [12:37:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] rsyslog: do not use the same queue name for two logs [puppet] - 10https://gerrit.wikimedia.org/r/801628 (owner: 10Giuseppe Lavagetto) [12:38:19] (03CR) 10Klausman: [C: 03+1] Add BGP configuration for the new ML staging codfw cluster (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [12:38:44] (03CR) 10Klausman: [C: 03+1] ml-services: add ptwiki & ruwiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/802097 (https://phabricator.wikimedia.org/T307418) (owner: 10Kevin Bazira) [12:40:18] (03Abandoned) 10Klausman: labs: Add dummy token for istio-cni on ML staging k8s [labs/private] - 10https://gerrit.wikimedia.org/r/775823 (owner: 10Klausman) [12:40:43] (03PS2) 10Giuseppe Lavagetto: mediawiki: disable revalidation everywhere [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) [12:40:45] (03PS1) 10Giuseppe Lavagetto: mediawiki: stop revalidating opcache on canaries [puppet] - 10https://gerrit.wikimedia.org/r/802134 [12:41:59] !log installing ruby-nokogiri security updates [12:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:46] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "This change should be deployed after the new scap version (4.8.1) is released." [puppet] - 10https://gerrit.wikimedia.org/r/802134 (owner: 10Giuseppe Lavagetto) [12:45:54] (03CR) 10David Caro: [C: 03+1] "I have not tested this, but I'm happy merging and seeing tweaking things later, let me know when/how you want to merge." [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299) (owner: 10Majavah) [12:47:51] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Please note that here "everywhere" means on all appservers,api and parsoid servers." [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [12:49:30] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:51:52] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:58] (03PS3) 10Majavah: P:metricsinfra::alertmanager: proxy access for trusted projects [puppet] - 10https://gerrit.wikimedia.org/r/795192 (https://phabricator.wikimedia.org/T304716) [12:54:00] (03PS6) 10Majavah: metricsinfra: Use prometheus-configurator [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299) [12:54:20] (03CR) 10David Caro: [C: 03+2] metricsinfra: Use prometheus-configurator [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299) (owner: 10Majavah) [12:55:14] (03CR) 10Majavah: metricsinfra: Use prometheus-configurator (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299) (owner: 10Majavah) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220601T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:18] yay [13:11:49] (03CR) 10Elukey: [C: 03+2] role::pki::multirootca: add settings for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/801744 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [13:15:24] (03PS6) 10Elukey: Add BGP configuration for the new ML staging codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) [13:15:42] (03CR) 10Elukey: Add BGP configuration for the new ML staging codfw cluster (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [13:16:19] (03CR) 10Elukey: [C: 03+2] admin_ng: set cfssl-issuer's values for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/801766 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [13:18:54] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "only noticed https://phabricator.wikimedia.org/T305135 after the fact" [puppet] - 10https://gerrit.wikimedia.org/r/801829 (owner: 10Dzahn) [13:18:59] (03PS2) 10Jbond: netbox: add proxy config to accounting report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802127 [13:19:42] (03CR) 10CI reject: [V: 04-1] netbox: add proxy config to accounting report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802127 (owner: 10Jbond) [13:21:04] 10SRE, 10Wikimedia-Mailing-lists: Postorius (held and) reported full headers get mangled somewhere in the system - https://phabricator.wikimedia.org/T309492 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:21:22] (03CR) 10Klausman: [C: 03+2] ml-services: add ptwiki & ruwiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/802097 (https://phabricator.wikimedia.org/T307418) (owner: 10Kevin Bazira) [13:24:24] (03Merged) 10jenkins-bot: ml-services: add ptwiki & ruwiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/802097 (https://phabricator.wikimedia.org/T307418) (owner: 10Kevin Bazira) [13:32:06] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:35] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:35] !log aikochou@deploy1002 Started deploy [ores/deploy@3d541df]: Deploy revscoring 2.11.4 to ORES - T309536 [13:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:40] T309536: Deploy revscoring 2.11.4 to ORES - https://phabricator.wikimedia.org/T309536 [13:49:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] "1 inline comment, rest LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/800118 (owner: 10Ahmon Dancy) [13:53:42] (03PS3) 10Vlad.shapik: WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [13:53:46] (03PS1) 10JMeybohm: Fix CI not failing on "helm template" errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/802137 [13:54:34] (03PS2) 10JMeybohm: Fix CI not failing on "helm template" errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/802137 [13:55:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] mwdebug service: Add traindev environment support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 (owner: 10Ahmon Dancy) [13:55:28] (03CR) 10CI reject: [V: 04-1] Fix CI not failing on "helm template" errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/802137 (owner: 10JMeybohm) [13:56:28] (03PS3) 10Jbond: netbox: add proxy config to accounting report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802127 [13:57:10] (03CR) 10CI reject: [V: 04-1] netbox: add proxy config to accounting report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802127 (owner: 10Jbond) [13:58:29] (03PS4) 10Jbond: netbox: add proxy config to accounting report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802127 [13:59:20] (03PS1) 10Cathal Mooney: Change cloudgw configuration to use rack-specific GW IP [puppet] - 10https://gerrit.wikimedia.org/r/802140 (https://phabricator.wikimedia.org/T304989) [13:59:23] (03CR) 10CI reject: [V: 04-1] netbox: add proxy config to accounting report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802127 (owner: 10Jbond) [13:59:24] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:00:23] (03PS5) 10Jbond: netbox: add proxy config to accounting report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802127 [14:01:32] (03CR) 10Jbond: "Ready for review" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802127 (owner: 10Jbond) [14:05:21] (03PS2) 10Cathal Mooney: Change cloudgw configuration to use rack-specific GW IP [puppet] - 10https://gerrit.wikimedia.org/r/802140 (https://phabricator.wikimedia.org/T304989) [14:05:53] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/801799 (https://phabricator.wikimedia.org/T286911) (owner: 10JHathaway) [14:08:33] 10SRE-swift-storage, 10Discovery-Search (Current work): Create swift thanos account for Search platform team - https://phabricator.wikimedia.org/T309715 (10bking) [14:10:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1046.eqiad.wmnet with OS bullseye [14:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:46] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1046.eqiad.wmnet with OS bullseye [14:15:06] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:16] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:17:30] (03PS1) 10Majavah: kubeadm: drop support for 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/802143 [14:18:30] (03CR) 10David Caro: [C: 03+2] Change cloudgw configuration to use rack-specific GW IP [puppet] - 10https://gerrit.wikimedia.org/r/802140 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [14:22:10] (03PS1) 10Majavah: aptrepo: add thirdparty/kubeadm-k8s-1-22 [puppet] - 10https://gerrit.wikimedia.org/r/802146 (https://phabricator.wikimedia.org/T286856) [14:25:43] !log aikochou@deploy1002 Finished deploy [ores/deploy@3d541df]: Deploy revscoring 2.11.4 to ORES - T309536 (duration: 45m 07s) [14:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:46] T309536: Deploy revscoring 2.11.4 to ORES - https://phabricator.wikimedia.org/T309536 [14:25:48] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:28:19] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1046.eqiad.wmnet with reason: host reimage [14:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:01] jouncebot: nowandnext [14:30:02] No deployments scheduled for the next 3 hour(s) and 29 minute(s) [14:30:02] In 3 hour(s) and 29 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220601T1800) [14:30:02] In 3 hour(s) and 29 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220601T1800) [14:30:13] (03PS1) 10Ladsgroup: Don't call saveOptions in Hooks::onAccountCreated [extensions/Thanks] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/802109 (https://phabricator.wikimedia.org/T306636) [14:31:00] (03PS1) 10Ladsgroup: Don't call saveOptions in Hooks::onAccountCreated [extensions/Thanks] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802110 (https://phabricator.wikimedia.org/T306636) [14:31:12] (03CR) 10Ladsgroup: [C: 03+2] Don't call saveOptions in Hooks::onAccountCreated [extensions/Thanks] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/802109 (https://phabricator.wikimedia.org/T306636) (owner: 10Ladsgroup) [14:31:18] (03CR) 10Ladsgroup: [C: 03+2] Don't call saveOptions in Hooks::onAccountCreated [extensions/Thanks] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802110 (https://phabricator.wikimedia.org/T306636) (owner: 10Ladsgroup) [14:31:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1046.eqiad.wmnet with reason: host reimage [14:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:46] (03PS1) 10Cwhite: opensearch_dashboards: copy latest backup to a predictable name [puppet] - 10https://gerrit.wikimedia.org/r/802149 (https://phabricator.wikimedia.org/T237224) [14:41:35] (03PS2) 10Jbond: check_netbox_report: add url to output [puppet] - 10https://gerrit.wikimedia.org/r/802075 [14:42:15] (03PS2) 10Cwhite: opensearch_dashboards: copy latest backup to a predictable name [puppet] - 10https://gerrit.wikimedia.org/r/802149 (https://phabricator.wikimedia.org/T237224) [14:42:27] (03PS1) 10Jelto: gitlab: make gitlab1004 new production instance [puppet] - 10https://gerrit.wikimedia.org/r/802150 (https://phabricator.wikimedia.org/T307142) [14:45:15] (03CR) 10Jbond: [C: 03+2] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/802091 (owner: 10Majavah) [14:45:41] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35655/console" [puppet] - 10https://gerrit.wikimedia.org/r/802150 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [14:45:58] (03CR) 10CI reject: [V: 04-1] Don't call saveOptions in Hooks::onAccountCreated [extensions/Thanks] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802110 (https://phabricator.wikimedia.org/T306636) (owner: 10Ladsgroup) [14:46:45] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/802098 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [14:46:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1046.eqiad.wmnet with OS bullseye [14:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:57] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1046.eqiad.wmnet with OS bullseye completed: - ms-be1046 (**PASS**) - Downtim... [14:47:32] (03PS3) 10Cwhite: opensearch_dashboards: copy latest backup to a predictable name [puppet] - 10https://gerrit.wikimedia.org/r/802149 (https://phabricator.wikimedia.org/T237224) [14:47:51] (03Abandoned) 10Jbond: C:netbox: add http proxy config to uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/802105 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [14:48:54] 10SRE-swift-storage, 10Discovery-Search (Current work): Create swift thanos account for Search platform team - https://phabricator.wikimedia.org/T309715 (10bking) [14:49:24] (03Merged) 10jenkins-bot: Don't call saveOptions in Hooks::onAccountCreated [extensions/Thanks] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/802109 (https://phabricator.wikimedia.org/T306636) (owner: 10Ladsgroup) [14:49:30] (03Merged) 10jenkins-bot: Don't call saveOptions in Hooks::onAccountCreated [extensions/Thanks] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802110 (https://phabricator.wikimedia.org/T306636) (owner: 10Ladsgroup) [14:50:23] (03PS1) 10Cwhite: opensearch_dashboards: add and enable bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/802151 (https://phabricator.wikimedia.org/T237224) [14:51:53] (03CR) 10Jcrespo: "In terms of bacula needed code, this is as good as it gets." [puppet] - 10https://gerrit.wikimedia.org/r/802151 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [14:52:00] (03CR) 10Jbond: [C: 03+1] wmcs: relabel alerts from wmcs cluster with wmcs team [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [14:53:01] (03PS2) 10BryanDavis: developer-portal: Bump container version to 2022-06-01-143800-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/799416 [14:53:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:44] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.13/extensions/Thanks/includes/Hooks.php: Backport: [[gerrit:802109|Don't call saveOptions in Hooks::onAccountCreated (T306636)]] (duration: 03m 10s) [14:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:47] T306636: UserOptionsManager: DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction ([db])Function: MediaWiki\User\UserOptionsManager::saveOptionsInternalQuery - https://phabricator.wikimedia.org/T306636 [14:59:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:00:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:16] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.14/extensions/Thanks/includes/Hooks.php: Backport: [[gerrit:802110|Don't call saveOptions in Hooks::onAccountCreated (T306636)]] (duration: 03m 10s) [15:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:20] T306636: UserOptionsManager: DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction ([db])Function: MediaWiki\User\UserOptionsManager::saveOptionsInternalQuery - https://phabricator.wikimedia.org/T306636 [15:07:40] (03CR) 10Jcrespo: "See suggestions. 0:-)" [puppet] - 10https://gerrit.wikimedia.org/r/802151 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [15:09:52] (03CR) 10Jcrespo: [C: 03+1] opensearch_dashboards: copy latest backup to a predictable name [puppet] - 10https://gerrit.wikimedia.org/r/802149 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [15:11:33] (03CR) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [15:12:18] (03PS2) 10Cwhite: opensearch_dashboards: add and enable bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/802151 (https://phabricator.wikimedia.org/T237224) [15:12:24] (03CR) 10Cwhite: opensearch_dashboards: add and enable bacula backups (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802151 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [15:12:57] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/801704 (owner: 10David Caro) [15:13:33] (03PS4) 10Cwhite: opensearch_dashboards: copy latest backup to a predictable name [puppet] - 10https://gerrit.wikimedia.org/r/802149 (https://phabricator.wikimedia.org/T237224) [15:13:58] (03CR) 10Jcrespo: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/802151 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [15:14:46] (03CR) 10David Caro: [C: 03+2] "yep, last usage: If173dc3104a236c6d864fe167cd7594b048d4340" [puppet] - 10https://gerrit.wikimedia.org/r/802099 (owner: 10Majavah) [15:14:55] (03CR) 10Jcrespo: [C: 03+1] opensearch_dashboards: copy latest backup to a predictable name [puppet] - 10https://gerrit.wikimedia.org/r/802149 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [15:15:07] (03CR) 10Cwhite: [C: 03+2] opensearch_dashboards: copy latest backup to a predictable name [puppet] - 10https://gerrit.wikimedia.org/r/802149 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [15:15:51] (03PS3) 10Cwhite: opensearch_dashboards: add and enable bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/802151 (https://phabricator.wikimedia.org/T237224) [15:16:02] (03CR) 10Cwhite: [C: 03+2] opensearch_dashboards: add and enable bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/802151 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [15:17:26] taavi: we have a lot of NRPE checks reporting unknown status on icinga.. could it be related to ead3c7e3dc? [15:17:29] (03CR) 10David Caro: [C: 03+2] codfw1dev,wmcs: Add labtest/wmcs-roots to the admin groups [puppet] - 10https://gerrit.wikimedia.org/r/801704 (owner: 10David Caro) [15:20:00] godog: looking [15:20:06] vgutierrez: looking [15:20:10] godog: ignore :) [15:24:07] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1047.eqiad.wmnet with OS bullseye [15:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:10] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1047.eqiad.wmnet with OS bullseye [15:26:01] (03PS1) 10Jbond: C:redis: update check location [puppet] - 10https://gerrit.wikimedia.org/r/802155 [15:26:33] (03CR) 10Jbond: [C: 03+2] C:redis: update check location [puppet] - 10https://gerrit.wikimedia.org/r/802155 (owner: 10Jbond) [15:26:43] (03CR) 10Jbond: [V: 03+2 C: 03+2] C:redis: update check location [puppet] - 10https://gerrit.wikimedia.org/r/802155 (owner: 10Jbond) [15:27:05] (03PS7) 10David Caro: wmcs: Added taskircmail, ircmail and pageircmail routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 [15:27:07] (03PS7) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team [puppet] - 10https://gerrit.wikimedia.org/r/802074 [15:30:19] the bacula monitoring stack broke [15:31:46] (03CR) 10Jcrespo: "This broke bacula monitoring." [puppet] - 10https://gerrit.wikimedia.org/r/801665 (owner: 10Majavah) [15:32:10] jynus: can you be more specific [15:32:51] https://gerrit.wikimedia.org/r/c/operations/puppet/+/801665/2/modules/profile/manifests/backup/director.pp#213 [15:33:35] check_bacula.py got moved to an nrpe plugin path, but this breaks prometheus monitoring and the command line interface [15:33:46] (03PS1) 10David Caro: network.tests:Use correct object for site [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802158 [15:33:49] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:34:15] (03PS1) 10Jbond: Revert "move more nrpe checks to nrpe::plugin and sudo_user" [puppet] - 10https://gerrit.wikimedia.org/r/802113 [15:34:30] not everything needs to be reverted [15:34:34] only the bacula part [15:34:51] jynus: ack thats what im doing [15:34:58] how did that get merged without adding me as a reviewer? [15:35:03] I see the value of that patch [15:35:22] but cannot be done without refactoring prometheus and command line [15:35:44] (03PS2) 10Jbond: Revert "move more nrpe checks to nrpe::plugin and sudo_user" [puppet] - 10https://gerrit.wikimedia.org/r/802113 [15:36:53] (03CR) 10CI reject: [V: 04-1] network.tests:Use correct object for site [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802158 (owner: 10David Caro) [15:37:59] jynus: taavi: has been doing lots of theses, 99% of the case its a very simple change so i have been sheperding. however a couple like this one have had some subtlties that have been missed [15:38:21] jynus: fyi https://gerrit.wikimedia.org/r/c/operations/puppet/+/802113 [15:39:28] jbond: can you add me as a reviewer? [15:40:05] done [15:40:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1047.eqiad.wmnet with reason: host reimage [15:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:54] so the context is that that is a "regular executable" it is not a nagios plugin [15:41:20] just happens to be used by one, but it does a lot of other things, like exporting prometheus metrics [15:41:27] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/801693 (https://phabricator.wikimedia.org/T309027) (owner: 10Volans) [15:41:39] sorry about that jynus! and thanks for the fix jbond [15:41:54] I am very open to any suggestion on how to improve it [15:42:10] jynus: ack may need to update the nrpre::plugion to support not that im not sure how flexabl it is of hand without looking [15:42:21] but I would like to be involved [15:42:31] (I usually take very little time to do a review) [15:42:39] and documentation should be more or less up to date: https://wikitech.wikimedia.org/wiki/Check_bacula.py [15:42:54] ^ "Check bacula is a wrapper for bconsole that is able to produce output to be used by icinga and prometheus, for Bacula monitoring in the WMF production." [15:43:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1047.eqiad.wmnet with reason: host reimage [15:43:36] jynus: ack like i said this fdor the 99% of the case is just a puppet refactor which is why i pushed it through without subject matter expert review, however i should have noticed that the bin was in .usr/bin and nor usr/lib/nagios... [15:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:38] it has an "--icinga" mode, but also "--prometheus" and I use it regularly as a shorthand for many backup monitoring task [15:44:00] (03PS3) 10David Caro: Fix spelling errors [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801730 [15:44:02] (03PS4) 10David Caro: wmcs: added missing __init__.py and relted lint fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801732 [15:44:04] e.g. there is a systemctl service that depends on that [15:44:04] (03PS3) 10David Caro: Add readme, configure script and missing modules [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/799379 [15:44:12] either way can i get some +1 on the revert to fix the current issue [15:44:29] yep, sorry, was trying to explain, will get you one quickly [15:45:11] thanks and completly understand that this onne should have had addtional review, it just slipped through and i 100% own that :) [15:45:39] don't worry, I was just worried at first [15:45:51] as long as it is monitoring affected, I will be happy! [15:45:55] (03PS1) 10Elukey: profile::kubernetes: remove ml-staging specific bits [labs/private] - 10https://gerrit.wikimedia.org/r/802159 (https://phabricator.wikimedia.org/T302195) [15:46:06] (only monitoring vs actual backups, for example) [15:46:10] ack thanks [15:47:57] mm, I am unsure about the revert [15:48:26] (03CR) 10CI reject: [V: 04-1] Fix spelling errors [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801730 (owner: 10David Caro) [15:48:58] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/801693 (https://phabricator.wikimedia.org/T309027) (owner: 10Volans) [15:49:09] I am comparing and not sure if it is a rebase issue but I am seeing differences with the original status [15:49:21] maybe I am comparing agains the wrong patch [15:49:38] (03CR) 10CI reject: [V: 04-1] wmcs: added missing __init__.py and relted lint fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801732 (owner: 10David Caro) [15:49:40] (03CR) 10CI reject: [V: 04-1] Add readme, configure script and missing modules [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/799379 (owner: 10David Caro) [15:49:50] jynus: the orinal is https://gerrit.wikimedia.org/r/c/operations/puppet/+/802113/2/modules/profile/manifests/backup/director.pp [15:50:02] what is the specific issue you are seeing? [15:50:12] feel free to mark on either the original or the revert [15:51:15] (03CR) 10Jbond: [C: 03+1] sre.swift.convert-ssds: add new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/801693 (https://phabricator.wikimedia.org/T309027) (owner: 10Volans) [15:52:03] (03CR) 10Jcrespo: [C: 03+1] Revert "move more nrpe checks to nrpe::plugin and sudo_user" [puppet] - 10https://gerrit.wikimedia.org/r/802113 (owner: 10Jbond) [15:52:08] let's just merge [15:52:24] (03CR) 10Jbond: [C: 03+2] Revert "move more nrpe checks to nrpe::plugin and sudo_user" [puppet] - 10https://gerrit.wikimedia.org/r/802113 (owner: 10Jbond) [15:52:44] and I will restart prometheus to make sure things are working [15:52:49] jynus: merging but more then happy to look at any issues you see [15:52:54] *prometheus exporter [15:52:58] * jbond merged [15:53:52] and again, I am not saying that is untouchable, I am happy to receive suggestions, there is surely a lot to improve on how that is done [15:56:06] (03PS1) 10David Caro: sre: fix some lint errors [cookbooks] - 10https://gerrit.wikimedia.org/r/802164 [15:56:36] (03CR) 10David Caro: "This is blocking some patches on wmcs branch from passing the tests." [cookbooks] - 10https://gerrit.wikimedia.org/r/802164 (owner: 10David Caro) [15:57:15] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [15:57:22] (03PS1) 10Jbond: P:backup::director: use new sudo_user parameter for nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/802165 [15:57:49] jynus: taavi: i think ^^^^ https://gerrit.wikimedia.org/r/802165 is what is needed to convert this check to use the new sudo_user functionaliutry [15:57:51] 10SRE, 10Infrastructure-Foundations: SSH host key verification failures in Ganeti intra node SSH calls after Bullseye update - https://phabricator.wikimedia.org/T309724 (10MoritzMuehlenhoff) [15:57:55] but i need to jumpinto a meeting now [15:57:57] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:58:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1047.eqiad.wmnet with OS bullseye [15:58:10] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1047.eqiad.wmnet with OS bullseye completed: - ms-be1047 (**PASS**) - Downtim... [15:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:26] 10SRE, 10Analytics, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10BTullis) I like the look of this task, so I'm going to claim it if noone minds. Predictably enough, I think that we should use MirrorMaker 2 and run it... [15:58:58] jbond: that seems more reasonable to me, but let me get back to you, as I got interrumped by this issue in the middle of a new backup setup [15:59:13] (03CR) 10Klausman: [C: 03+1] profile::kubernetes: remove ml-staging specific bits [labs/private] - 10https://gerrit.wikimedia.org/r/802159 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [16:00:38] jynus: acl sure now rush :) [16:02:41] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10hashar) A few people poked me about that instance. Looks like th... [16:07:03] (03CR) 10Majavah: [C: 03+1] "looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/802165 (owner: 10Jbond) [16:13:42] (03CR) 10Jcrespo: [C: 03+1] "Looks good to me, and innocent enough that I can check it easily after deploy (check command works as expected, prometheus keeps collectin" [puppet] - 10https://gerrit.wikimedia.org/r/802165 (owner: 10Jbond) [16:14:30] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::kubernetes: remove ml-staging specific bits [labs/private] - 10https://gerrit.wikimedia.org/r/802159 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [16:17:26] on the good news- prometheus wasn't restarted- so we didn't lose the metrics for grafana [16:23:24] (03PS1) 10Btullis: Reduce the retntion time for hadoop namenode fsimages [puppet] - 10https://gerrit.wikimedia.org/r/802169 (https://phabricator.wikimedia.org/T309649) [16:23:42] (03PS2) 10Btullis: Reduce the retention time for hadoop namenode fsimages [puppet] - 10https://gerrit.wikimedia.org/r/802169 (https://phabricator.wikimedia.org/T309649) [16:26:55] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35657/console" [puppet] - 10https://gerrit.wikimedia.org/r/802169 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis) [16:28:08] (03PS1) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 [16:28:25] 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10hashar) a:05thcipriani→03hashar I am not entirely sure what `contint-admins` group grants but I will review it. The main concern I have is the CI stack is very fragile :-\ [16:31:58] (03CR) 10CI reject: [V: 04-1] wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [16:38:01] (03CR) 10David Caro: wmcs: vps: create_instance_with_prefix: unbreak (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [16:42:07] (03CR) 10Btullis: [V: 03+1 C: 03+2] Reduce the retention time for hadoop namenode fsimages [puppet] - 10https://gerrit.wikimedia.org/r/802169 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis) [16:43:10] (03CR) 10Ottomata: "TY!" [puppet] - 10https://gerrit.wikimedia.org/r/802169 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis) [16:46:31] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:52:54] 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10Dzahn) @hashar This is the, pretty specific (good thing!), list of things that contint-admins grants: ` privileges: ['ALL = (jenkins) NOPASSWD: ALL', 'ALL = (jenkins... [17:09:02] (03PS1) 10Ladsgroup: Don't call saveSettings in EchoNotificationsHandlers::doLocalUserCreated [extensions/Wikibase] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802114 (https://phabricator.wikimedia.org/T306636) [17:09:13] jouncebot: nowandnext [17:09:13] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [17:09:14] In 0 hour(s) and 50 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220601T1800) [17:09:14] In 0 hour(s) and 50 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220601T1800) [17:09:26] (03CR) 10Ladsgroup: [C: 03+2] Don't call saveSettings in EchoNotificationsHandlers::doLocalUserCreated [extensions/Wikibase] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802114 (https://phabricator.wikimedia.org/T306636) (owner: 10Ladsgroup) [17:14:17] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:27:59] (03CR) 10CI reject: [V: 04-1] Don't call saveSettings in EchoNotificationsHandlers::doLocalUserCreated [extensions/Wikibase] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802114 (https://phabricator.wikimedia.org/T306636) (owner: 10Ladsgroup) [17:29:29] boooo CI [17:31:30] (03PS2) 10Ladsgroup: Don't call saveSettings in EchoNotificationsHandlers::doLocalUserCreated [extensions/Wikibase] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802114 (https://phabricator.wikimedia.org/T306636) [17:31:36] (03CR) 10Ladsgroup: [C: 03+2] Don't call saveSettings in EchoNotificationsHandlers::doLocalUserCreated [extensions/Wikibase] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802114 (https://phabricator.wikimedia.org/T306636) (owner: 10Ladsgroup) [17:31:45] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [17:53:11] (03Merged) 10jenkins-bot: Don't call saveSettings in EchoNotificationsHandlers::doLocalUserCreated [extensions/Wikibase] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802114 (https://phabricator.wikimedia.org/T306636) (owner: 10Ladsgroup) [17:58:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:18] (03PS2) 10Muehlenhoff: puppetdb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801636 (https://phabricator.wikimedia.org/T308013) [17:59:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:59:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] jeena and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220601T1800). [18:00:05] jeena and dancy: Your horoscope predicts another unfortunate MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220601T1800). [18:00:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:35] 10SRE, 10Machine-Learning-Team, 10ORES: Stress test ORES on kubernetes (above 4.5k scores/second) - https://phabricator.wikimedia.org/T214054 (10Krinkle) [18:03:29] (03PS1) 10Jeena Huneidi: group1 wikis to 1.39.0-wmf.14 refs T308067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802173 [18:03:31] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.39.0-wmf.14 refs T308067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802173 (owner: 10Jeena Huneidi) [18:03:52] (03CR) 10Muehlenhoff: [C: 03+2] puppetdb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801636 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:04:15] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.14 refs T308067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802173 (owner: 10Jeena Huneidi) [18:04:38] (03PS2) 10Muehlenhoff: karapace: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801637 (https://phabricator.wikimedia.org/T308013) [18:04:49] oops Amir1 are you still deploying something? [18:05:08] it should have been done soon-ish [18:05:10] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.14/extensions/Wikibase/client: Backport: [[gerrit:802114|Don't call saveSettings in EchoNotificationsHandlers::doLocalUserCreated (T306636)]] (duration: 03m 11s) [18:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:14] done now [18:05:14] T306636: UserOptionsManager: DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction ([db])Function: MediaWiki\User\UserOptionsManager::saveOptionsInternalQuery - https://phabricator.wikimedia.org/T306636 [18:05:30] okay [18:05:32] all good. Sorry I stepped on your toes. It took way more than expected [18:05:46] that's alright, I didn't check far enough in the backscroll [18:05:52] (03CR) 10Muehlenhoff: [C: 03+2] karapace: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801637 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:07:13] (03CR) 10Muehlenhoff: [C: 03+2] matomo: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801640 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:07:19] (03PS3) 10Muehlenhoff: matomo: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801640 (https://phabricator.wikimedia.org/T308013) [18:10:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:11:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:39] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.14 refs T308067 [18:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:42] T308067: 1.39.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T308067 [18:12:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:44] (03PS1) 10Muehlenhoff: Remove superflous comment [puppet] - 10https://gerrit.wikimedia.org/r/802174 [18:14:41] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.14 refs T308067 (duration: 03m 02s) [18:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:14] (03CR) 10Muehlenhoff: [C: 03+2] Remove old buster idp-test hosts, add new idp/bullseye ones from Ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/802098 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [18:16:16] (03CR) 10Dzahn: [C: 03+1] Remove superflous comment [puppet] - 10https://gerrit.wikimedia.org/r/802174 (owner: 10Muehlenhoff) [18:19:16] (03PS1) 10Muehlenhoff: Trim comment [puppet] - 10https://gerrit.wikimedia.org/r/802175 [18:19:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/802174 (owner: 10Muehlenhoff) [18:21:01] (03CR) 10Volans: [C: 03+2] "LGTM, thanks for the fixes!" [cookbooks] - 10https://gerrit.wikimedia.org/r/802164 (owner: 10David Caro) [18:24:32] (03Merged) 10jenkins-bot: sre: fix some lint errors [cookbooks] - 10https://gerrit.wikimedia.org/r/802164 (owner: 10David Caro) [18:25:58] (03PS1) 10Muehlenhoff: Record extended MOU for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/802176 [18:28:09] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 43.08 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [18:30:19] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [18:30:30] (03CR) 10Muehlenhoff: [C: 03+2] Record extended MOU for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/802176 (owner: 10Muehlenhoff) [18:36:51] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/802150 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [18:37:50] 10SRE, 10ops-drmrs, 10DC-Ops, 10Traffic: hw troubleshooting: cp6006 b2 dimm issue - https://phabricator.wikimedia.org/T309123 (10RobH) 05Open→03Resolved [18:38:11] (03PS1) 10Volans: ganeti-netbox-sync: refactor into classes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 [18:38:13] (03PS1) 10Volans: Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) [18:39:35] (03CR) 10Volans: "As I mentioned in Icc65dc8961983b2e40638abf19beda7363f20a57 this was some previous work that I did to refactor the script in order to be a" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 (owner: 10Volans) [18:41:39] (03CR) 10Volans: "And this is the group support part, that I finished today. I think it works (from my tests on netbox-next) but it would need some more tes" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [18:42:43] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/802175 (owner: 10Muehlenhoff) [18:43:30] (03CR) 10Volans: "To test it you can run on netbox-dev2002 as root:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 (owner: 10Volans) [18:52:00] !log About to deploy analytics/refinery (weekly deployment train) [18:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:08] !log ebysans@deploy1002 Started deploy [analytics/refinery@13f791b]: Regular analytics weekly train [analytics/refinery@13f791b] [18:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:28] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Dereckson) [19:12:00] 10SRE, 10LDAP-Access-Requests: Add Evelien WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T309700 (10KFrancis) I have the info I need for the NDA and will put the document together. Please look for a message from DocuSign soon to complete the NDA. Thanks! [19:16:35] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:18:36] 10SRE, 10Analytics, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10Ottomata) > Predictably enough, I think that we should use MirrorMaker 2 and run it in k8s on the wikikube clusters :-) This would be awesome. I'd be r... [19:19:20] !log ebysans@deploy1002 Finished deploy [analytics/refinery@13f791b]: Regular analytics weekly train [analytics/refinery@13f791b] (duration: 23m 12s) [19:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:59] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:33:48] (03PS1) 10Muehlenhoff: Add Dereckson to contributors [puppet] - 10https://gerrit.wikimedia.org/r/802191 [19:35:17] (03CR) 10Muehlenhoff: [C: 03+2] Add Dereckson to contributors [puppet] - 10https://gerrit.wikimedia.org/r/802191 (owner: 10Muehlenhoff) [19:35:24] !log ebysans@deploy1002 Started deploy [analytics/refinery@13f791b] (thin): Regular analytics weekly train THIN [analytics/refinery@13f791b] [19:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:31] !log ebysans@deploy1002 Finished deploy [analytics/refinery@13f791b] (thin): Regular analytics weekly train THIN [analytics/refinery@13f791b] (duration: 00m 07s) [19:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:53] !log ebysans@deploy1002 Started deploy [analytics/refinery@13f791b] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@13f791b] [19:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:23] 10SRE, 10Analytics, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10CDanis) >>! In T304373#7974286, @BTullis wrote: > I like the look of this task, so I'm going to claim it if noone minds. Please go right ahead! I am h... [19:42:58] !log ebysans@deploy1002 Finished deploy [analytics/refinery@13f791b] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@13f791b] (duration: 07m 06s) [19:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:01] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:48:07] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:00:05] RoanKattouw, Urbanecm, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220601T2000). [20:00:05] zabe: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10Jclark-ctr) backup1009 E3 U1 cableid 20220271 : port 1 [20:01:56] o/ [20:02:21] zabe: if you're around, I'm wondering about your patch [20:02:44] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:03:00] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:04:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.744 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:04:24] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48248 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:04:51] cjming, hey, I am here now, sorry for being late [20:05:34] no worries [20:05:57] zabe: is this patch a re-do of the one that caused fatals some days ago? [20:06:18] no [20:06:43] oh good - basically i have no idea if it should be merged [20:07:07] The one you had to revert I was able to re-do last Thursday [20:07:19] gtk [20:07:39] (03PS2) 10Ahmon Dancy: mediawiki 0.2.1: Add a helm test [deployment-charts] - 10https://gerrit.wikimedia.org/r/800118 [20:08:25] (03PS2) 10Clare Ming: Start writing to cuc_actor everywhere except s4 and s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800278 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:08:57] (03CR) 10Andrew Bogott: [C: 03+2] developer-portal: Bump container version to 2022-06-01-143800-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/799416 (owner: 10BryanDavis) [20:09:11] (03CR) 10Ahmon Dancy: mediawiki 0.2.1: Add a helm test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/800118 (owner: 10Ahmon Dancy) [20:09:26] !log Successfully deployed refinery using scap, then deployed onto hdfs. [20:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:34] 10SRE, 10Release-Engineering-Team, 10Scap, 10serviceops: Deploy Scap version 4.8.1 - https://phabricator.wikimedia.org/T309116 (10dancy) [20:12:36] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2022-06-01-143800-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/799416 (owner: 10BryanDavis) [20:13:42] (03PS1) 10Andrew Bogott: Toolhub: Prepare to deploy 2022-05-30-111657-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/802194 [20:13:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [20:13:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [20:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298560)', diff saved to https://phabricator.wikimedia.org/P29323 and previous config saved to /var/cache/conftool/dbconfig/20220601-201402-ladsgroup.json [20:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:05] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [20:15:24] !log andrew@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [20:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:32] !log andrew@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [20:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:06] zabe: being a relatively green deployer, I'm not sure sometimes what to do - I'm inclined to think it will be fine and we can revert if things go awry [20:16:40] is there a way to test on the debug server? [20:17:07] (03CR) 10Ahmon Dancy: mediawiki 0.2.1: Add a helm test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/800118 (owner: 10Ahmon Dancy) [20:17:10] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:18:07] yes, that should be fine. There is nothing to test. We are enabling writing to a new coloumn of the cu_changes table. Writing to this had been enabled on the testwikis for 2 weeks now and all s3 wikis for like half a week with no more errors. So yes, if errors should show up (which is unlikely), just revert. [20:18:23] alrighty then [20:18:30] (03CR) 10Clare Ming: [C: 03+2] Start writing to cuc_actor everywhere except s4 and s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800278 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:18:33] (03CR) 10Andrew Bogott: [C: 03+2] Toolhub: Prepare to deploy 2022-05-30-111657-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/802194 (owner: 10Andrew Bogott) [20:19:24] !log andrew@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [20:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:42] cjming: zabe please double check it on debug server, those kind of changes can cause data corruption in theory (by writing data that's not consistent). [20:19:45] !log andrew@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [20:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:51] just a fyi :) [20:19:53] (03Merged) 10jenkins-bot: Start writing to cuc_actor everywhere except s4 and s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800278 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:20:31] !log andrew@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [20:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:06] thanks urbanecm -- zabe: it should be on mwdebug1001 [20:21:11] !log andrew@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [20:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:18] !log andrew@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [20:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:22] ok, i'll take a look [20:21:47] !log andrew@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [20:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:58] (03Merged) 10jenkins-bot: Toolhub: Prepare to deploy 2022-05-30-111657-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/802194 (owner: 10Andrew Bogott) [20:23:03] cjming, editing works without errors and logstash looks clear [20:23:17] great - then syncing [20:23:52] shouldn't we check cuc_actor too? [20:24:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:24:03] just in case [20:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:12] if you want? [20:24:24] I made a test edit in dewiki [20:24:43] zabe: mind linking? :) [20:24:52] urbanecm: not sure how to test what you're asking about? [20:25:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:25:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:06] https://de.wikipedia.org/w/index.php?title=Benutzer:Zabe/Test&diff=prev&oldid=223359043&diffmode=source [20:25:23] cjming: checking whether the DB has intended content. it might not cause any errors, but it can add nonsense stuff to the DB in theory :) [20:25:28] looking [20:26:32] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:800278|Start writing to cuc_actor everywhere except s4 and s8 (T233004)]] (duration: 03m 01s) [20:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:36] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [20:26:44] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:26:50] it looks to work (and it also is synced, so all good :)) [20:27:21] 😌 [20:28:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:06] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:31:08] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:31:46] !log andrew@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [20:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:25] urbanecm: thanks for your gut check [20:32:31] !log andrew@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [20:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:42] shutting it down early [20:32:54] !log end of UTC late backport window [20:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:07] no problem. i generally try to be careful with this kind of things 🙂 [20:34:21] seems like a wise rule of thumb [20:35:05] !log andrew@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [20:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:06] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:36:17] !log andrew@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [20:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:45] !log andrew@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [20:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:04] !log andrew@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [20:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:03] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Southparkfan) [20:50:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [20:50:42] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10RobH) this host is showing an error on https://netbox.wikimedia.org/extras/reports/results/3198630/ so I've set it to failed in netbox to fix the report [20:53:22] CannotCreateActorException [20:53:23] bah [20:56:09] zabe: do we revert? [20:56:46] urbanecm, yeah, let's revert [20:57:23] (03PS1) 10Urbanecm: Revert "Start writing to cuc_actor everywhere except s4 and s8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802115 (https://phabricator.wikimedia.org/T233004) [20:57:32] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "reverting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802115 (https://phabricator.wikimedia.org/T233004) (owner: 10Urbanecm) [20:57:43] Why is Abusefilter sending stuff to cu_changes with non-existing users ... [20:58:13] not sure [20:58:35] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5a8e7586bcb0933c96e8294e389c31270edb134e: Revert "Start writing to cuc_actor everywhere except s4 and s8" (T233004) (duration: 00m 32s) [20:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:40] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [20:58:47] zabe: reverted in prod now [20:59:05] can i help with anything else? [20:59:14] no, thanks :) [20:59:19] okay, great [20:59:28] I have made this task https://phabricator.wikimedia.org/T309737, is it related to the reverted patch zabe urbanecm ? [20:59:49] i think so [20:59:58] yes [21:00:44] hmm, maybe I should go ahead and close it then [21:01:31] * urbanecm just closed it [21:01:44] 🙂 thanks [21:02:04] (03CR) 10Ahmon Dancy: mwdebug service: Add traindev environment support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 (owner: 10Ahmon Dancy) [21:03:16] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:04:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:18] 22:57 Why is Abusefilter sending stuff to cu_changes with non-existing users ... <== actually, i think i know why. AF can prevent accounts from being created. maybe that's what happened? [21:05:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:05:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:46] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:05:52] answering myself: yes, it is (`18:47, 1 June 2022: Qwqqwqq (talk | block) triggered filter 451, performing the action "autocreateaccount" on Speciale:Entra. Actions taken: Disallow; Filter description: Prevenzione NUI casuali (details | examine)`) [21:05:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:44] urbanecm, yeah, seems so. It just bugged me a little bit because I don't know what to do now. There is no problem in the CheckUser code per se. [21:08:31] gah - just catching up - thanks zabe, urbanecm for taking care of things [21:09:22] no problem [21:10:22] zabe: so, before the actor config change was made, CU adds rows that are like `cuc_user=0;cuc_user_text='Foobar'`. perhaps let's do the equivalent of that in the actor world? [21:10:32] like, add an actor with actor_name='Foobar' and actor_user=null [21:10:33] !log restart wdqs-blazegraph on wdqs1007 to resolve BlazegraphFreeAllocatorsDecreasingRapidly [21:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:35] The idea itself sounds fair, but it no longer works, considering that the long-term goal is to drop cuc_user and cuc_user_text. [21:14:57] zabe: yeah, i know. i meant to create a new actor, with actor_name='Foobar', actor_user=null and use that in cuc_actor [21:15:36] that should be still possible even when cuc_user/cuc_user_text are gone? but perhaps i'm missing something about how actors work [21:16:31] i see. I am fairly certain that the current actor system does not allow actor_user=null for something else than ips. [21:17:09] https://gerrit.wikimedia.org/g/mediawiki/core/+/c830404d1d92a97c5f8d8a4766f6117dfff6d2ab/includes/user/ActorStore.php#612 [21:17:29] ^ there is this ActorStore::validateActorForInsertion function which prevents insertion like this [21:17:50] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10phaultfinder) [21:18:25] zabe: I said that inspired by https://phabricator.wikimedia.org/P29324 (how xwiki blocks are inserted). not sure how those rows get inserted, but apparently they're a thing :-) [21:19:47] AFAIK importing a page via special:import and leaving "Assign edits to local users where the named user exists locally" unchecked will result in something similar (username in a form of `interwiki>user`) [21:19:49] ok, I need to correct myself. the actor system does not allow actor_user=null when the username is usable. (which is not the case for the external usernames due to the ">") [21:20:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [21:20:06] i see [21:21:24] And allowing insertions like that could lead to problems, because someone else could later create an account with that name, leading to two entries in the actor table with the same actor_name value. [21:21:27] (03CR) 10Krinkle: Add language fallback support for wmgSiteLogoVariants (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [21:21:33] good point [21:22:17] * urbanecm is thinking about using `invalid>Foobar` or `abusefilter>Foobar`, but those sound like workarounds [21:22:22] (03CR) 10Krinkle: zhwiki: Use wmgSiteLogoVariants to simplify logo variant settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802133 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [21:22:49] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10phaultfinder) [21:25:35] I probably need to talk with the AbuseFilter devs (I guess Daimona) about this. [21:25:57] yeah, sounds like a good idea. thanks for working on the migration zabe! [21:26:44] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:30:01] np [21:30:14] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:31:20] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:33:38] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: restart to enable S3 plugin - bking@cumin1001 - T309720 [21:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:43] T309720: Deploy S3 plugin on all Search team-managed Elastic hosts - https://phabricator.wikimedia.org/T309720 [21:33:58] !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: restart to enable S3 plugin - bking@cumin1001 - T309720 [21:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:44] PROBLEM - Check systemd state on cloudelastic1004 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service,wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:58] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:59:08] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: number_of_data_nodes: 6, number_of_in_flight_fetch: 0, number_of_pending_tasks: 1, delayed_unassigned_shards: 0, number_of_nodes: 6, active_shards_percent_as_number: 99.07955292570676, active_primary_shards: 758, active_shards: 1507, initializing_shards: 10, relocating_shards: 0, cluster_name: cloudelastic- [21:59:08] d, task_max_waiting_in_queue_millis: 0, timed_out: False, status: red, unassigned_shards: 4 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:04:26] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:06:08] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:06:58] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:08:18] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: task_max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0, initializing_shards: 5, timed_out: False, number_of_in_flight_fetch: 0, status: red, unassigned_shards: 2, cluster_name: cloudelastic-chi-eqiad, active_shards_percent_as_number: 99.5397764628534, active_primary_shards: 758, active_shards: 1514 [22:08:18] _of_nodes: 6, number_of_data_nodes: 6, relocating_shards: 0, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:13:23] !log T309720 Downtimed cloudelastic until Monday while we perform maintenance across the next couple days (will manually lift downtime later) [22:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:28] T309720: Deploy S3 plugin on all Search team-managed Elastic hosts - https://phabricator.wikimedia.org/T309720 [22:16:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:19:42] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:58:59] (03CR) 10Dduvall: "We now have a package for jwt-authorizer and this is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [22:59:03] (03CR) 10CI reject: [V: 04-1] docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [23:01:43] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools automatic topic subscriptions as beta feature on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802214 (https://phabricator.wikimedia.org/T295425) [23:05:10] (03CR) 10Aaron Schulz: [C: 03+1] Add the master from the primary DC to the secondary DC load arrays (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [23:06:42] (03PS5) 10Dduvall: docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) [23:07:34] (03CR) 10CI reject: [V: 04-1] docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [23:09:19] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools automatic topic subscriptions as beta feature on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802214 (https://phabricator.wikimedia.org/T295425) [23:09:27] (03PS2) 10Bartosz Dziewoński: Launch DiscussionTools topic subscriptions a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801818 (https://phabricator.wikimedia.org/T304029) [23:21:13] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:22:40] (03PS6) 10Dduvall: docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) [23:23:36] (03CR) 10CI reject: [V: 04-1] docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [23:33:01] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook