[00:00:40] <icinga-wm>	 RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:00] <icinga-wm>	 PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:09:10] <icinga-wm>	 PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:38:40] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:40:15] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:45:15] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:54:02] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10lmata) p:05Triage→03Medium sgtm
[02:05:54] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[02:10:26] <icinga-wm>	 RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:19:48] <icinga-wm>	 PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:21:28] <icinga-wm>	 RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:25:20] <wikibugs>	 (03PS2) 10JHathaway: Move vendored modules to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/770099 (https://phabricator.wikimedia.org/T302423)
[03:27:14] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jhathaway) There seems to be some coalescing around moving vendored modules into their own directory, here is a patch that does just that, feedback very much appreciated,...
[03:59:30] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) For this round of reimaging I'm happy to just edit the options while reimaging, but  - I'll want to do this myself so I do...
[05:39:55] <wikibugs>	 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10odimitrijevic)
[05:45:23] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[06:05:54] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[06:05:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[06:05:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[06:05:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:06:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[06:12:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[06:12:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:38] <icinga-wm>	 PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:18:41] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/770315
[06:19:51] <wikibugs>	 (03PS1) 10Marostegui: Revert "wmnet: Switchover m2-master" [dns] - 10https://gerrit.wikimedia.org/r/770052
[06:19:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/770315 (owner: 10Marostegui)
[06:20:12] <wikibugs>	 (03PS2) 10Marostegui: Revert "wmnet: Switchover m2-master" [dns] - 10https://gerrit.wikimedia.org/r/770052
[06:22:51] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/770053
[06:24:42] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/770316
[06:25:17] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance
[06:25:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance
[06:25:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 12 hosts with reason: Maintenance
[06:25:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/770316 (owner: 10Marostegui)
[06:25:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:28] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 12 hosts with reason: Maintenance
[06:25:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/770053 (owner: 10Marostegui)
[06:30:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Switchover m2-master" [dns] - 10https://gerrit.wikimedia.org/r/770052 (owner: 10Marostegui)
[06:30:32] <wikibugs>	 (03PS3) 10Marostegui: Revert "wmnet: Switchover m2-master" [dns] - 10https://gerrit.wikimedia.org/r/770052
[06:35:02] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:44:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[06:44:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[06:44:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:44:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220313T0800)
[07:00:05] <jouncebot>	 Amir1, awight, Urbanecm, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220314T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:22] <taavi>	 aren't these usually an hour later?
[07:01:02] * taavi can't guarantee being here this early
[07:02:22] <apergos>	 I only know about the thursday ones
[07:02:25] <apergos>	 those are an hour later
[07:02:57] <taavi>	 those are usually the same time as the normal windows
[07:03:14] <apergos>	 oh
[07:03:19] <taavi>	 maybe some DST mess?
[07:03:21] <apergos>	 it's daylight something-or-other
[07:03:30] <apergos>	 where one part of the world switched and not the other
[07:03:33] <apergos>	 sigh
[07:03:49] <apergos>	 yeah because my thursday one is set for 9 am also and that's early 
[07:03:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[07:03:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[07:03:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[07:03:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[07:04:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T298294)', diff saved to https://phabricator.wikimedia.org/P22375 and previous config saved to /var/cache/conftool/dbconfig/20220314-070404-marostegui.json
[07:04:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:08] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[07:07:15] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[07:07:16] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[07:07:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T298563)', diff saved to https://phabricator.wikimedia.org/P22376 and previous config saved to /var/cache/conftool/dbconfig/20220314-070721-marostegui.json
[07:07:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:25] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[07:11:39] <marostegui>	 !log dbmaint on s7@eqiad T300775
[07:11:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:45] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[07:12:29] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance
[07:12:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance
[07:12:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 8 hosts with reason: Maintenance
[07:12:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 8 hosts with reason: Maintenance
[07:12:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:51] <elukey>	 !log restart varnishkafka-webrequest on cp6001 to test a metric issue
[07:18:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:20] <icinga-wm>	 RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:24:32] <wikibugs>	 (03PS1) 10Elukey: Set bullseye and overlayfs for kubernetes2017 [puppet] - 10https://gerrit.wikimedia.org/r/770439 (https://phabricator.wikimedia.org/T300744)
[07:24:34] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1007 [puppet] - 10https://gerrit.wikimedia.org/r/770440 (https://phabricator.wikimedia.org/T300744)
[07:33:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298294)', diff saved to https://phabricator.wikimedia.org/P22377 and previous config saved to /var/cache/conftool/dbconfig/20220314-073313-marostegui.json
[07:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:17] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[07:33:38] <icinga-wm>	 PROBLEM - Host ps1-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[07:36:52] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:39:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond)
[07:40:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond)
[07:43:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298563)', diff saved to https://phabricator.wikimedia.org/P22378 and previous config saved to /var/cache/conftool/dbconfig/20220314-074323-marostegui.json
[07:43:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:28] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[07:48:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P22379 and previous config saved to /var/cache/conftool/dbconfig/20220314-074818-marostegui.json
[07:48:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:04] <wikibugs>	 (03PS1) 10Marostegui: Revert "wmnet: Failover m3-master" [dns] - 10https://gerrit.wikimedia.org/r/770054
[07:49:32] <wikibugs>	 (03PS2) 10Marostegui: Revert "wmnet: Failover m3-master" [dns] - 10https://gerrit.wikimedia.org/r/770054
[07:50:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Failover m3-master" [dns] - 10https://gerrit.wikimedia.org/r/770054 (owner: 10Marostegui)
[07:56:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/769996 (https://phabricator.wikimedia.org/T303031) (owner: 10Ayounsi)
[07:58:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P22380 and previous config saved to /var/cache/conftool/dbconfig/20220314-075828-marostegui.json
[07:58:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P22381 and previous config saved to /var/cache/conftool/dbconfig/20220314-080323-marostegui.json
[08:03:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:22] <icinga-wm>	 PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[08:06:28] <icinga-wm>	 PROBLEM - Host asw-a-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[08:07:54] <icinga-wm>	 PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[08:07:56] <icinga-wm>	 PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[08:08:02] <icinga-wm>	 PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[08:09:40] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:09:40] <icinga-wm>	 PROBLEM - Host mr1-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[08:09:42] <icinga-wm>	 PROBLEM - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - 6 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[08:10:15] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[08:10:16] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:10:41] <icinga-wm>	 RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.00 ms
[08:10:45] <icinga-wm>	 RECOVERY - Host asw-a-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.53 ms
[08:10:47] <icinga-wm>	 RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.44 ms
[08:10:49] <icinga-wm>	 PROBLEM - Host db2075.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:10:49] <icinga-wm>	 PROBLEM - Host db2136.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:10:51] <icinga-wm>	 PROBLEM - Host es2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:10:51] <icinga-wm>	 PROBLEM - Host kubestage2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:10:51] <icinga-wm>	 PROBLEM - Host mc2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:10:51] <icinga-wm>	 PROBLEM - Host ml-serve2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:10:51] <icinga-wm>	 RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms
[08:10:53] <icinga-wm>	 PROBLEM - Host ps1-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[08:10:53] <icinga-wm>	 PROBLEM - Host re0.cr1-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:10:53] <icinga-wm>	 PROBLEM - Host scs-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[08:11:22] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Set bullseye and overlayfs for kubernetes2017 [puppet] - 10https://gerrit.wikimedia.org/r/770439 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[08:11:33] <icinga-wm>	 RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.48 ms
[08:12:31] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:13:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P22382 and previous config saved to /var/cache/conftool/dbconfig/20220314-081333-marostegui.json
[08:13:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:29] <icinga-wm>	 PROBLEM - Juniper alarms on asw-a-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[08:15:42] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add dalezhou to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/769996 (https://phabricator.wikimedia.org/T303031) (owner: 10Ayounsi)
[08:16:31] <icinga-wm>	 RECOVERY - Host mr1-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 33.47 ms
[08:16:59] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[08:17:50] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Specify db1132 status [puppet] - 10https://gerrit.wikimedia.org/r/770443 (https://phabricator.wikimedia.org/T303395)
[08:18:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298294)', diff saved to https://phabricator.wikimedia.org/P22383 and previous config saved to /var/cache/conftool/dbconfig/20220314-081828-marostegui.json
[08:18:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[08:18:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:32] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[08:18:32] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[08:18:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Specify db1132 status [puppet] - 10https://gerrit.wikimedia.org/r/770443 (https://phabricator.wikimedia.org/T303395) (owner: 10Marostegui)
[08:18:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298294)', diff saved to https://phabricator.wikimedia.org/P22384 and previous config saved to /var/cache/conftool/dbconfig/20220314-081836-marostegui.json
[08:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:55] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10ayounsi) 05Open→03Resolved @Dale_Zhou your account has been created, please reopen the task if you're having any issues.  You can find instructio...
[08:27:02] <wikibugs>	 (03PS4) 10Muehlenhoff: Remove cumin2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/769712 (https://phabricator.wikimedia.org/T303399)
[08:28:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298563)', diff saved to https://phabricator.wikimedia.org/P22385 and previous config saved to /var/cache/conftool/dbconfig/20220314-082838-marostegui.json
[08:28:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[08:28:42] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[08:28:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:44] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[08:28:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298563)', diff saved to https://phabricator.wikimedia.org/P22386 and previous config saved to /var/cache/conftool/dbconfig/20220314-082846-marostegui.json
[08:28:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516) (owner: 10Btullis)
[08:31:39] <icinga-wm>	 PROBLEM - IPMI Sensor Status on db2075 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:31:54] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Stop loading wddx PHP extension with PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/769745 (https://phabricator.wikimedia.org/T295725) (owner: 10JMeybohm)
[08:32:19] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Make k8s-ingress-wikikube page [puppet] - 10https://gerrit.wikimedia.org/r/767078 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[08:34:15] <wikibugs>	 (03CR) 10DCausse: "unclear why but looking at PCC we seem to fix T303256 by making -DwikibaseSomeValueMode=skolem effective again. I'm not sure I understand " [puppet] - 10https://gerrit.wikimedia.org/r/742670 (https://phabricator.wikimedia.org/T301108) (owner: 10DCausse)
[08:38:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bullseye and overlayfs for kubernetes2017 [puppet] - 10https://gerrit.wikimedia.org/r/770439 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[08:39:31] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ml-serve2005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:39:41] <elukey>	 ah lovely
[08:39:50] <elukey>	 this is one of the new nodes
[08:40:06] <elukey>	 anyway no user traffic, will open a task
[08:40:07] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 79 probes of 669 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:40:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298294)', diff saved to https://phabricator.wikimedia.org/P22387 and previous config saved to /var/cache/conftool/dbconfig/20220314-084036-marostegui.json
[08:40:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:40] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[08:44:55] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mc2019 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:46:28] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2017.codfw.wmnet with OS bullseye
[08:46:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:53] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 59 probes of 669 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:47:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10ayounsi) Agreed!  > i.e. our interface automation should check the adjacent ports, and not allow ge-0/0/1 to be created if xe-0/0/0 exists....
[08:47:57] <wikibugs>	 (03CR) 10Ayounsi: "Can you share the Jinja side as well so I can review the full picture?" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[08:48:35] <wikibugs>	 10SRE, 10serviceops, 10User-jijiki: Move debugging symbols and tools to a new class - https://phabricator.wikimedia.org/T236048 (10MoritzMuehlenhoff) 05Open→03Declined This doesn't seem relevant any more, I'll boldly go ahead and close it. We originally used it for HHVM and these days we can easily insta...
[08:53:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[08:53:37] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/770446
[08:54:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/770446 (owner: 10Marostegui)
[08:54:34] <icinga-wm>	 PROBLEM - IPMI Sensor Status on es2026 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:55:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P22388 and previous config saved to /var/cache/conftool/dbconfig/20220314-085541-marostegui.json
[08:55:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:55] <wikibugs>	 10ops-codfw: codfw A1 power outage - https://phabricator.wikimedia.org/T303696 (10ayounsi) p:05Triage→03High
[09:00:37] <wikibugs>	 10ops-codfw: codfw A1 power outage - https://phabricator.wikimedia.org/T303696 (10ayounsi) Surprisingly both msw1-codfw PSUs are ON: ` msw1-codfw> show chassis environment  Class Item                           Status     Measurement Power FPC 0 Power Supply 0           OK               FPC 0 Power Supply 1...
[09:01:14] <icinga-wm>	 ACKNOWLEDGEMENT - Host scs-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696
[09:01:14] <icinga-wm>	 ACKNOWLEDGEMENT - Host re0.cr1-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696
[09:01:14] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-a1-codfw-infeed-load-tower-B-phase-Z on ps1-a1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:01:14] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-a1-codfw-infeed-load-tower-B-phase-Y on ps1-a1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:01:14] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-a1-codfw-infeed-load-tower-B-phase-X on ps1-a1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:01:14] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-a1-codfw-infeed-load-tower-A-phase-Z on ps1-a1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:01:14] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-a1-codfw-infeed-load-tower-A-phase-Y on ps1-a1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:01:15] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-a1-codfw-infeed-load-tower-A-phase-X on ps1-a1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:01:15] <icinga-wm>	 ACKNOWLEDGEMENT - Host ps1-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696
[09:01:32] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2017.codfw.wmnet with reason: host reimage
[09:01:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:06] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:02:09] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on ml-serve2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:02:09] <icinga-wm>	 ACKNOWLEDGEMENT - Host ml-serve2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696
[09:02:09] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on mc2019.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:02:09] <icinga-wm>	 ACKNOWLEDGEMENT - Host mc2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696
[09:02:09] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on kubestage2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:02:09] <icinga-wm>	 ACKNOWLEDGEMENT - Host kubestage2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696
[09:02:09] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on es2026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:02:10] <icinga-wm>	 ACKNOWLEDGEMENT - Host es2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696
[09:02:10] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on db2136.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:02:11] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2136.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696
[09:02:11] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on db2075.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:02:12] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2075.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696
[09:02:35] <elukey>	 XioNoX: thanks I was wondering what was happening
[09:03:14] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper alarms on asw-a-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[09:03:14] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - 6 red alarms, 0 yellow alarms ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[09:03:33] <XioNoX>	 elukey: I think more happened, seeing all those "SSH" alerts
[09:03:41] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[09:04:56] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2017.codfw.wmnet with reason: host reimage
[09:04:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:11] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[09:08:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298563)', diff saved to https://phabricator.wikimedia.org/P22389 and previous config saved to /var/cache/conftool/dbconfig/20220314-090830-marostegui.json
[09:08:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:35] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[09:09:50] <icinga-wm>	 PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:10:11] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[09:10:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P22390 and previous config saved to /var/cache/conftool/dbconfig/20220314-091046-marostegui.json
[09:10:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:10] <icinga-wm>	 RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:11:11] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[09:15:28] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:16:11] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[09:17:22] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2017.codfw.wmnet with OS bullseye
[09:17:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:10] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: reimage of puppet servers can fail - https://phabricator.wikimedia.org/T235067 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff We can close this task and there's no Puppet server specific change needed. There have been various...
[09:18:37] <moritzm>	 !log installing vim security updates
[09:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:41] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10MoritzMuehlenhoff)
[09:23:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P22391 and previous config saved to /var/cache/conftool/dbconfig/20220314-092335-marostegui.json
[09:23:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298294)', diff saved to https://phabricator.wikimedia.org/P22392 and previous config saved to /var/cache/conftool/dbconfig/20220314-092551-marostegui.json
[09:25:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[09:25:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[09:25:55] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[09:25:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298294)', diff saved to https://phabricator.wikimedia.org/P22393 and previous config saved to /var/cache/conftool/dbconfig/20220314-092559-marostegui.json
[09:26:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:34] <wikibugs>	 10SRE-swift-storage: Bring ms-fe10[09-12] into service - https://phabricator.wikimedia.org/T303698 (10MatthewVernon)
[09:31:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond)
[09:34:02] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516) (owner: 10Btullis)
[09:37:22] <wikibugs>	 (03PS3) 10Btullis: Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516)
[09:37:40] <wikibugs>	 (03PS1) 10MVernon: swift: add new proxies as proxyhosts, memcached_servers, conftool [puppet] - 10https://gerrit.wikimedia.org/r/770452 (https://phabricator.wikimedia.org/T303698)
[09:38:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516) (owner: 10Btullis)
[09:38:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P22394 and previous config saved to /var/cache/conftool/dbconfig/20220314-093840-marostegui.json
[09:38:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:29] <wikibugs>	 (03PS4) 10Btullis: Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516)
[09:41:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] swift: add new proxies as proxyhosts, memcached_servers, conftool [puppet] - 10https://gerrit.wikimedia.org/r/770452 (https://phabricator.wikimedia.org/T303698) (owner: 10MVernon)
[09:45:07] <wikibugs>	 (03PS4) 10Btullis: Fix the prometheus elasticsearch exporter on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599)
[09:46:11] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34234/console" [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) (owner: 10Btullis)
[09:46:25] <Amir1>	 !log dbmaint on s1@eqiad (T298743)
[09:46:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:29] <Amir1>	 !log dbmaint on s8@eqiad (T298743)
[09:46:29] <stashbot>	 T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743
[09:46:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:47] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516) (owner: 10Btullis)
[09:47:51] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] swift: add new proxies as proxyhosts, memcached_servers, conftool [puppet] - 10https://gerrit.wikimedia.org/r/770452 (https://phabricator.wikimedia.org/T303698) (owner: 10MVernon)
[09:48:35] <Emperor>	 btullis: there's a puppet change waiting for merge "Enable production shell access for Njideka Okafor"; OK to merge?
[09:48:46] <Amir1>	 !log dbmaint on s2@eqiad (T298743)
[09:48:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:20] <btullis>	 Emperor: please merge that one. I got a lock error running puppet-merge at the same time as you :-)
[09:49:54] <Emperor>	 done, thanks :)
[09:50:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298294)', diff saved to https://phabricator.wikimedia.org/P22395 and previous config saved to /var/cache/conftool/dbconfig/20220314-095009-marostegui.json
[09:50:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:13] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[09:51:35] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond)
[09:53:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] C:varnish: use X-Public-Cloud to store the cloud provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond)
[09:53:26] <Emperor>	 !log rebooting ms-fe10[09-12] as part of bringing into service T303698
[09:53:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:31] <stashbot>	 T303698: Bring ms-fe10[09-12] into service - https://phabricator.wikimedia.org/T303698
[09:53:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298563)', diff saved to https://phabricator.wikimedia.org/P22396 and previous config saved to /var/cache/conftool/dbconfig/20220314-095346-marostegui.json
[09:53:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[09:53:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[09:53:49] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[09:53:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298563)', diff saved to https://phabricator.wikimedia.org/P22397 and previous config saved to /var/cache/conftool/dbconfig/20220314-095353-marostegui.json
[09:53:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:16] <wikibugs>	 (03PS1) 10Volans: Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456
[09:55:02] <icinga-wm>	 PROBLEM - Host ms-fe1011 is DOWN: PING CRITICAL - Packet loss = 100%
[09:55:38] <icinga-wm>	 RECOVERY - Host ms-fe1011 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[09:57:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans)
[09:57:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond)
[09:59:42] <wikibugs>	 (03CR) 10Hashar: "Looks good. We can do the same for the CI machine modules/profile/manifests/ci/httpd.pp :)" [puppet] - 10https://gerrit.wikimedia.org/r/769718 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:59:46] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] Enable profile::auto_restarts::service for apache/doc [puppet] - 10https://gerrit.wikimedia.org/r/769718 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:03:18] <wikibugs>	 (03PS1) 10Elukey: Set overlayfs + bullseye for kubernetes2005 [puppet] - 10https://gerrit.wikimedia.org/r/770459 (https://phabricator.wikimedia.org/T300744)
[10:03:25] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] Fix the prometheus elasticsearch exporter on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) (owner: 10Btullis)
[10:03:30] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix the prometheus elasticsearch exporter on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) (owner: 10Btullis)
[10:03:54] <wikibugs>	 (03PS2) 10Volans: Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456
[10:05:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P22398 and previous config saved to /var/cache/conftool/dbconfig/20220314-100515-marostegui.json
[10:05:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:54] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[10:06:52] <icinga-wm>	 RECOVERY - Check systemd state on datahubsearch1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:06:53] <wikibugs>	 (03PS2) 10Btullis: Add monitoring for the datahubsearch LVS service [puppet] - 10https://gerrit.wikimedia.org/r/769451 (https://phabricator.wikimedia.org/T301458)
[10:06:59] <wikibugs>	 (03PS1) 10Ladsgroup: Add 2022/change_transcode_T298743.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770461 (https://phabricator.wikimedia.org/T298743)
[10:07:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Set overlayfs + bullseye for kubernetes2005 [puppet] - 10https://gerrit.wikimedia.org/r/770459 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[10:07:34] <icinga-wm>	 RECOVERY - Check systemd state on datahubsearch1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:07:46] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Dale_Zhou - https://phabricator.wikimedia.org/T303702 (10MGerlach)
[10:08:04] <icinga-wm>	 RECOVERY - Check systemd state on datahubsearch1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:08:17] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Dale_Zhou - https://phabricator.wikimedia.org/T303702 (10MGerlach)
[10:10:17] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for ShubhankarP - https://phabricator.wikimedia.org/T303703 (10MGerlach)
[10:10:59] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for ShubhankarP - https://phabricator.wikimedia.org/T303703 (10MGerlach)
[10:12:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Add 2022/change_transcode_T298743.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770461 (https://phabricator.wikimedia.org/T298743) (owner: 10Ladsgroup)
[10:13:30] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34237/console" [puppet] - 10https://gerrit.wikimedia.org/r/769451 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis)
[10:16:06] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Add 2022/change_transcode_T298743.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770461 (https://phabricator.wikimedia.org/T298743) (owner: 10Ladsgroup)
[10:17:08] <wikibugs>	 (03Merged) 10jenkins-bot: Add 2022/change_transcode_T298743.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770461 (https://phabricator.wikimedia.org/T298743) (owner: 10Ladsgroup)
[10:17:27] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Add monitoring for the datahubsearch LVS service [puppet] - 10https://gerrit.wikimedia.org/r/769451 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis)
[10:19:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove cumin2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/769712 (https://phabricator.wikimedia.org/T303399) (owner: 10Muehlenhoff)
[10:19:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set overlayfs + bullseye for kubernetes2005 [puppet] - 10https://gerrit.wikimedia.org/r/770459 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[10:20:08] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission cumin2001 - https://phabricator.wikimedia.org/T303399 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03Papaul
[10:20:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P22399 and previous config saved to /var/cache/conftool/dbconfig/20220314-102020-marostegui.json
[10:20:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:21] <_joe_>	 !log running puppet on all cp hosts, to introduce the cloud netmapping
[10:29:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:59] <icinga-wm>	 ACKNOWLEDGEMENT - LVS datahubsearch eqiad port 9200/tcp - Search cluster serving DataHub IPv4 on datahubsearch.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.71 and port 443: Connection refused Btullis Investigating. T301458 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:31:11] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for apache/doc [puppet] - 10https://gerrit.wikimedia.org/r/769718 (https://phabricator.wikimedia.org/T135991)
[10:31:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[10:33:07] <wikibugs>	 (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/770464
[10:35:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298294)', diff saved to https://phabricator.wikimedia.org/P22400 and previous config saved to /var/cache/conftool/dbconfig/20220314-103525-marostegui.json
[10:35:26] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[10:35:28] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[10:35:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:30] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[10:35:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298294)', diff saved to https://phabricator.wikimedia.org/P22401 and previous config saved to /var/cache/conftool/dbconfig/20220314-103532-marostegui.json
[10:35:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298563)', diff saved to https://phabricator.wikimedia.org/P22402 and previous config saved to /var/cache/conftool/dbconfig/20220314-103602-marostegui.json
[10:36:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:06] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[10:36:17] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:37:30] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add cumin aliases for ml-etcd [puppet] - 10https://gerrit.wikimedia.org/r/769730 (owner: 10Muehlenhoff)
[10:37:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/770464 (owner: 10Hashar)
[10:39:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for apache/doc [puppet] - 10https://gerrit.wikimedia.org/r/769718 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:40:16] <wikibugs>	 (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/770464 (owner: 10Hashar)
[10:40:53] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:43:53] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:49:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for apache/CI [puppet] - 10https://gerrit.wikimedia.org/r/770467 (https://phabricator.wikimedia.org/T135991)
[10:51:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P22403 and previous config saved to /var/cache/conftool/dbconfig/20220314-105107-marostegui.json
[10:51:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298294)', diff saved to https://phabricator.wikimedia.org/P22404 and previous config saved to /var/cache/conftool/dbconfig/20220314-105749-marostegui.json
[10:57:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:54] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[10:59:18] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[10:59:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[10:59:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:47] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:00:47] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:01:26] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[11:03:10] <logmsgbot>	 !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@594f1d5] (codfw): Revert "Revert "Mirror 100% of request to tegola in eqiad""
[11:03:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:41] <logmsgbot>	 !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@594f1d5] (codfw): Revert "Revert "Mirror 100% of request to tegola in eqiad"" (duration: 01m 30s)
[11:04:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:50] <logmsgbot>	 !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@594f1d5] (eqiad): Revert "Revert "Mirror 100% of request to tegola in eqiad""
[11:05:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P22405 and previous config saved to /var/cache/conftool/dbconfig/20220314-110612-marostegui.json
[11:06:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:17] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: P:cache::base: add netmapper file for abuse networks [puppet] - 10https://gerrit.wikimedia.org/r/769899 (https://phabricator.wikimedia.org/T302471)
[11:11:50] <wikibugs>	 (03PS1) 10Btullis: Update the monitoring check for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/770471 (https://phabricator.wikimedia.org/T301458)
[11:12:05] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] P:cache::base: add netmapper file for abuse networks [puppet] - 10https://gerrit.wikimedia.org/r/769899 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[11:12:52] <logmsgbot>	 !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@594f1d5] (eqiad): Revert "Revert "Mirror 100% of request to tegola in eqiad"" (duration: 07m 01s)
[11:12:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P22406 and previous config saved to /var/cache/conftool/dbconfig/20220314-111255-marostegui.json
[11:12:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:10] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34241/console" [puppet] - 10https://gerrit.wikimedia.org/r/770471 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis)
[11:15:33] <logmsgbot>	 !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@c8a9efd] (eqiad): Enable mirroring on eqiad with 50% of the traffic
[11:15:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:12] <logmsgbot>	 !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@c8a9efd] (eqiad): Enable mirroring on eqiad with 50% of the traffic (duration: 02m 38s)
[11:18:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:23] <wikibugs>	 (03CR) 10Elukey: "LGTM! I just left a nit that is probably me not understanding the code, feel free to review it and in case consider my review a +1." [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans)
[11:18:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] P:cache::base: add netmapper file for abuse networks [puppet] - 10https://gerrit.wikimedia.org/r/769899 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[11:18:59] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "This change adds additional firewall rules to Trusted GitLab Runners. By default they reject all outgoing docker tcp traffic to 10.0.0.0/8" [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[11:21:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298563)', diff saved to https://phabricator.wikimedia.org/P22407 and previous config saved to /var/cache/conftool/dbconfig/20220314-112117-marostegui.json
[11:21:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[11:21:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[11:21:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:23] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[11:21:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:26] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Update the monitoring check for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/770471 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis)
[11:28:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P22408 and previous config saved to /var/cache/conftool/dbconfig/20220314-112759-marostegui.json
[11:28:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:25] <wikibugs>	 (03PS2) 10Muehlenhoff: Add cumin aliases for ml-etcd [puppet] - 10https://gerrit.wikimedia.org/r/769730
[11:40:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add cumin aliases for ml-etcd [puppet] - 10https://gerrit.wikimedia.org/r/769730 (owner: 10Muehlenhoff)
[11:43:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298294)', diff saved to https://phabricator.wikimedia.org/P22409 and previous config saved to /var/cache/conftool/dbconfig/20220314-114305-marostegui.json
[11:43:06] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[11:43:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[11:43:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:10] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[11:43:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298294)', diff saved to https://phabricator.wikimedia.org/P22410 and previous config saved to /var/cache/conftool/dbconfig/20220314-114312-marostegui.json
[11:43:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:23] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:46:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Cumin alias for cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/770473
[11:50:53] <wikibugs>	 (03PS1) 10Vgutierrez: varnish::tests: Basic X-Public-Cloud test [puppet] - 10https://gerrit.wikimedia.org/r/770474
[11:51:46] <wikibugs>	 (03PS1) 10Btullis: Add single quotes around the regex to use [puppet] - 10https://gerrit.wikimedia.org/r/770475 (https://phabricator.wikimedia.org/T301458)
[11:53:27] <moritzm>	 !log restarting apache2 on matomo1002 to pick up security updates
[11:53:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:49] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34243/console" [puppet] - 10https://gerrit.wikimedia.org/r/770475 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis)
[11:54:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] C:varnish: load abuse_networks.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769900 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[11:54:11] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: C:varnish: load abuse_networks.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769900 (https://phabricator.wikimedia.org/T302471)
[11:55:41] <moritzm>	 !log restarting nginx on archiva1002 to pick up security updates
[11:55:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:49] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Add single quotes around the regex to use [puppet] - 10https://gerrit.wikimedia.org/r/770475 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis)
[11:58:13] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis)
[11:58:43] <wikibugs>	 (03PS1) 10Vgutierrez: Fix fe_ratelimit injection stub [labs/private] - 10https://gerrit.wikimedia.org/r/770476 (https://phabricator.wikimedia.org/T303534)
[12:00:15] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] Fix fe_ratelimit injection stub [labs/private] - 10https://gerrit.wikimedia.org/r/770476 (https://phabricator.wikimedia.org/T303534) (owner: 10Vgutierrez)
[12:03:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298294)', diff saved to https://phabricator.wikimedia.org/P22411 and previous config saved to /var/cache/conftool/dbconfig/20220314-120347-marostegui.json
[12:03:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:52] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[12:11:58] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) LDAP membership of the `wmf` groups has been added in T303512   I have created the kerberos principal.  ` btullis@krb1001:~$ sudo ma...
[12:13:24] <wikibugs>	 (03PS2) 10Vgutierrez: varnish::tests: Basic X-Public-Cloud test [puppet] - 10https://gerrit.wikimedia.org/r/770474
[12:15:23] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[12:17:13] <wikibugs>	 (03PS1) 10Joal: Update hadoop net-toplogy.sh script [puppet] - 10https://gerrit.wikimedia.org/r/770487
[12:18:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P22412 and previous config saved to /var/cache/conftool/dbconfig/20220314-121852-marostegui.json
[12:18:53] <ottomata>	 joal lgtm ^ should I merge?
[12:18:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:01] <wikibugs>	 (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[12:19:02] <joal>	 ottomata: please :)
[12:19:10] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Update hadoop net-toplogy.sh script [puppet] - 10https://gerrit.wikimedia.org/r/770487 (owner: 10Joal)
[12:19:31] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[12:19:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[12:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298563)', diff saved to https://phabricator.wikimedia.org/P22413 and previous config saved to /var/cache/conftool/dbconfig/20220314-121937-marostegui.json
[12:19:39] <joal>	 thanks a lot ottomata :)
[12:19:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:41] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[12:26:00] <icinga-wm>	 PROBLEM - SSH on analytics1067.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:26:04] <wikibugs>	 (03PS3) 10Vgutierrez: varnish::tests: Basic X-Public-Cloud test [puppet] - 10https://gerrit.wikimedia.org/r/770474
[12:33:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P22414 and previous config saved to /var/cache/conftool/dbconfig/20220314-123357-marostegui.json
[12:34:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:32] <wikibugs>	 (03PS1) 10Gergő Tisza: Stop using huwiki 500k milestone logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770496 (https://phabricator.wikimedia.org/T301923)
[12:36:34] <wikibugs>	 (03PS1) 10Gergő Tisza: Delete huwiki 500k milestone logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770497 (https://phabricator.wikimedia.org/T301923)
[12:37:43] <wikibugs>	 (03PS4) 10Vgutierrez: varnish::tests: Basic X-Public-Cloud test [puppet] - 10https://gerrit.wikimedia.org/r/770474
[12:42:46] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:46:00] <wikibugs>	 (03PS5) 10Vgutierrez: varnish::tests: Basic X-Public-Cloud test [puppet] - 10https://gerrit.wikimedia.org/r/770474
[12:46:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add Cumin alias for cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/770473 (owner: 10Muehlenhoff)
[12:49:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298294)', diff saved to https://phabricator.wikimedia.org/P22415 and previous config saved to /var/cache/conftool/dbconfig/20220314-124902-marostegui.json
[12:49:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[12:49:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[12:49:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:08] <wikibugs>	 (03CR) 10Vgutierrez: "vgutierrez@carrot:~/wikimedia.org/operations/puppet/modules/varnish/files/tests$ cat /tmp/vtcresults.temdsksHz8" [puppet] - 10https://gerrit.wikimedia.org/r/770474 (owner: 10Vgutierrez)
[12:49:08] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[12:49:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298294)', diff saved to https://phabricator.wikimedia.org/P22416 and previous config saved to /var/cache/conftool/dbconfig/20220314-124911-marostegui.json
[12:49:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:22] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) Hi CDanis!  Would it be possible to also update status.wikimedia.org to redi...
[12:58:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298563)', diff saved to https://phabricator.wikimedia.org/P22417 and previous config saved to /var/cache/conftool/dbconfig/20220314-125839-marostegui.json
[12:58:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:44] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[13:00:04] <wikibugs>	 (03PS3) 10Volans: Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220314T1300).
[13:00:05] <jouncebot>	 zabe and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:13] <tgr>	 o/
[13:00:15] <zabe>	 o/
[13:00:50] <wikibugs>	 (03CR) 10Volans: "fixed bug reported in comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans)
[13:01:08] <urbanecm>	 Hey! I can deploy today (unless tgr wishes to!)
[13:01:27] <tgr>	 thx urbanecm 
[13:02:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/770473 (owner: 10Muehlenhoff)
[13:04:22] <urbanecm>	 tgr: why is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/770496 changing static/images/project-logos/huwiki-1.5x.png please?
[13:07:16] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "I confirm `labweb1001/etc/mediawiki/WikitechPrivateSettings.php` has the new variable names, looks good otherwise, should work" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769750 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[13:07:56] <wikibugs>	 (03Merged) 10jenkins-bot: wikitech: migrate wmf* to wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769750 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[13:09:05] <urbanecm>	 zabe: I'm going to sync it because changes to wikitech.php can't be tested via mwdebug1001. Can you monitor the error logs for a while please?
[13:09:14] <zabe>	 yes
[13:10:34] <tgr>	 hm, not sure. It was generated by tox. The difference seems pretty significant.
[13:10:47] <tgr>	 I'll just revert it.
[13:10:51] <urbanecm>	 thank you
[13:10:58] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/wikitech.php: 95f376a: wikitech: migrate wmf* to wmg* (T45956) (duration: 00m 48s)
[13:11:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:02] <stashbot>	 T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956
[13:11:09] <urbanecm>	 it might be the commons file slightly changed, or different set of optimizers was used, not sure
[13:11:35] <wikibugs>	 (03PS2) 10Gergő Tisza: Stop using huwiki 500k milestone logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770496 (https://phabricator.wikimedia.org/T301923)
[13:11:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:11:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298294)', diff saved to https://phabricator.wikimedia.org/P22418 and previous config saved to /var/cache/conftool/dbconfig/20220314-131220-marostegui.json
[13:12:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:24] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[13:12:37] <wikibugs>	 (03PS2) 10Gergő Tisza: Delete huwiki 500k milestone logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770497 (https://phabricator.wikimedia.org/T301923)
[13:12:55] <urbanecm>	 tgr_: are you still here? my client shows the tgr nick quit
[13:13:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:13:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:13:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P22419 and previous config saved to /var/cache/conftool/dbconfig/20220314-131344-marostegui.json
[13:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:52] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] Stop using huwiki 500k milestone logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770496 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza)
[13:14:05] <tgr_>	 urbanecm: sorry, my bouncer is being difficult.
[13:14:55] <urbanecm>	 tgr_: no problem -- I can sync it w/o tests on your end if you want, as it's a pretty trivial change.
[13:14:58] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Stop using huwiki 500k milestone logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770496 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza)
[13:15:52] <wikibugs>	 (03Merged) 10jenkins-bot: Stop using huwiki 500k milestone logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770496 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza)
[13:16:45] <urbanecm>	 works fine for me, syncing
[13:17:07] <wikibugs>	 (03PS3) 10Urbanecm: Delete huwiki 500k milestone logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770497 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza)
[13:17:11] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Delete huwiki 500k milestone logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770497 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza)
[13:17:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:17:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:toolforge::static: publish SSH fingerprints under /admin [puppet] - 10https://gerrit.wikimedia.org/r/766292 (owner: 10Majavah)
[13:17:54] <wikibugs>	 (03Merged) 10jenkins-bot: Delete huwiki 500k milestone logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770497 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza)
[13:18:22] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: 3c2c8b0cca4e48f572abd3812594097a33e64379: Stop using huwiki 500k milestone logo (T301923) (duration: 00m 48s)
[13:18:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:25] <stashbot>	 T301923: Enable milestone logo for hu.wikipedia - 500K articles - https://phabricator.wikimedia.org/T301923
[13:20:28] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized static/images/project-logos/: 3fa9683: Delete huwiki 500k milestone logo files (T301923) (duration: 00m 49s)
[13:20:31] <urbanecm>	 tgr_: should be all live!
[13:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:46] <tgr_>	 Thanks! I suppose pages have to age out of varnish for the change to take effect?
[13:20:53] <tgr_>	 I see it on some pages but not all.
[13:21:55] <urbanecm>	 it should be much a shorter cache, it's just a different URI is in the CSS for background-image
[13:22:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:22:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:46] <wikibugs>	 (03CR) 10MMandere: [C: 03+1] "marc@stark:~/Projects/puppet/modules/varnish/files/tests$ cat /tmp/vtcresults.sLoTXC4t8E" [puppet] - 10https://gerrit.wikimedia.org/r/770474 (owner: 10Vgutierrez)
[13:23:33] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/770102 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[13:23:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:23:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:23:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:24:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:05] <dcausse>	 !log restarting blazegraph on wdqs1006 (jvm stuck for 10hours)
[13:25:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P22420 and previous config saved to /var/cache/conftool/dbconfig/20220314-132726-marostegui.json
[13:27:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P22421 and previous config saved to /var/cache/conftool/dbconfig/20220314-132849-marostegui.json
[13:28:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:02] <tgr_>	 urbanecm: we are well beyond the 5min ResourceLoader expiry now and the old logo is still present. In any case, just a curiosity, it's not a problem if it lingers for a while (delays when enabling the milestone logo would be worse, but I don't remember seeing that). Thanks again!
[13:31:55] <urbanecm>	 that's weird. i don't see it at all, but i also don't visit huwiki frequently, so that might well be it
[13:32:02] <tgr_>	 (although now that I said that, I don't see it anymore. Maybe it was 10m?)
[13:34:16] <wikibugs>	 (03PS1) 10Gergő Tisza: Add a note about tox requirements for changing logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770502
[13:34:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:34:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:09] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] Add a note about tox requirements for changing logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770502 (owner: 10Gergő Tisza)
[13:36:20] <Emperor>	 !log restarting swift-proxy on ms-fe100[5-8] to update config to know about new eqiad frontends T303698
[13:36:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:23] <stashbot>	 T303698: Bring ms-fe10[09-12] into service - https://phabricator.wikimedia.org/T303698
[13:39:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:39:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:39:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P22422 and previous config saved to /var/cache/conftool/dbconfig/20220314-134231-marostegui.json
[13:42:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:03] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye
[13:43:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[13:43:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:43:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1016.eqiad.wmnet with OS b...
[13:43:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298563)', diff saved to https://phabricator.wikimedia.org/P22423 and previous config saved to /var/cache/conftool/dbconfig/20220314-134356-marostegui.json
[13:43:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[13:43:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[13:43:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:00] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[13:44:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:24] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1017.eqiad.wmnet with OS bullseye
[13:44:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:41] <wikibugs>	 (03PS5) 10Herron: envoy: manage strip_matching_host_port setting and enable on thanos-fe [puppet] - 10https://gerrit.wikimedia.org/r/769749 (https://phabricator.wikimedia.org/T300119)
[13:45:28] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1009.eqiad.wmnet
[13:45:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:40] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1009.eqiad.wmnet
[13:45:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:49] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1009.eqiad.wmnet
[13:45:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:57] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1009.eqiad.wmnet
[13:45:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:48:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:55] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1012.eqiad.wmnet
[13:49:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:12] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1012.eqiad.wmnet
[13:50:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:18] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1012.eqiad.wmnet
[13:50:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725 (10CDanis)
[13:50:23] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1012.eqiad.wmnet
[13:50:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725 (10CDanis) p:05Triage→03Low
[13:51:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ayounsi) I came across 3 planned parse servers in rack C8, https://netbox.wikimedia.org/dcim/devices/?q=&rack_id=24&role=server As a reminder, C8 and D5 are dedica...
[13:52:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:52:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:52:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:52] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1016.eqiad.wmnet with reason: host reimage
[13:53:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:54:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:27] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1017.eqiad.wmnet with reason: host reimage
[13:55:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:06] <icinga-wm>	 RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:56:26] <wikibugs>	 (03PS15) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481)
[13:56:48] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1011.eqiad.wmnet
[13:56:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:55] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1011.eqiad.wmnet
[13:56:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:02] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1011.eqiad.wmnet
[13:57:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:08] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1011.eqiad.wmnet
[13:57:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:13] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) @lmata yeah, sorry, that's been on my backlog but I had been putting it off...
[13:57:19] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1016.eqiad.wmnet with reason: host reimage
[13:57:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298294)', diff saved to https://phabricator.wikimedia.org/P22424 and previous config saved to /var/cache/conftool/dbconfig/20220314-135736-marostegui.json
[13:57:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[13:57:39] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[13:57:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:40] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[13:57:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298294)', diff saved to https://phabricator.wikimedia.org/P22425 and previous config saved to /var/cache/conftool/dbconfig/20220314-135744-marostegui.json
[13:57:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:31] <herron>	 !log grafana1002:~# systemctl restart grafana-ldap-users-sync.service T303064
[13:58:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:34] <stashbot>	 T303064: grafana-ldap-users-sync fails to finish intermittently - https://phabricator.wikimedia.org/T303064
[13:59:00] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34256/console" [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[13:59:57] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1017.eqiad.wmnet with reason: host reimage
[13:59:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:42] <wikibugs>	 (03PS1) 10JMeybohm: Remove LVS for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966)
[14:01:27] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1010.eqiad.wmnet
[14:01:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:33] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1010.eqiad.wmnet
[14:01:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:39] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1010.eqiad.wmnet
[14:01:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:47] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1010.eqiad.wmnet
[14:01:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:07] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] gitlab_runner: restrict docker traffic with additional ferm rules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[14:04:14] <wikibugs>	 (03CR) 10Volans: elastic: relax & restore perms during upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/769109 (https://phabricator.wikimedia.org/T301955) (owner: 10Ryan Kemper)
[14:05:48] <icinga-wm>	 RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:54] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[14:06:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans)
[14:07:09] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish::tests: Basic X-Public-Cloud test [puppet] - 10https://gerrit.wikimedia.org/r/770474 (owner: 10Vgutierrez)
[14:08:50] <wikibugs>	 (03PS1) 10Ottomata: Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324)
[14:09:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) (owner: 10Ottomata)
[14:10:08] <wikibugs>	 (03PS2) 10Ottomata: Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324)
[14:10:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) (owner: 10Ottomata)
[14:11:40] <wikibugs>	 (03PS3) 10Ottomata: Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324)
[14:17:55] <wikibugs>	 (03CR) 10Joal: Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) (owner: 10Ottomata)
[14:20:15] <wikibugs>	 (03PS1) 10JMeybohm: Move miscweb from it's own LVS VIP to k8s-ingress-wikikube [dns] - 10https://gerrit.wikimedia.org/r/770506 (https://phabricator.wikimedia.org/T290966)
[14:22:05] <wikibugs>	 (03PS2) 10JMeybohm: Remove LVS for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966)
[14:22:39] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Adding a question to a question :)" [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx)
[14:24:49] <wikibugs>	 (03PS4) 10Ottomata: Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324)
[14:25:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298294)', diff saved to https://phabricator.wikimedia.org/P22426 and previous config saved to /var/cache/conftool/dbconfig/20220314-142502-marostegui.json
[14:25:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:07] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[14:25:29] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34260/console" [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) (owner: 10Ottomata)
[14:27:15] <wikibugs>	 (03PS5) 10Ottomata: Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324)
[14:27:53] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34261/console" [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) (owner: 10Ottomata)
[14:28:10] <icinga-wm>	 RECOVERY - SSH on analytics1067.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:31:44] <wikibugs>	 (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/pcc-worker1001/34255/" [puppet] - 10https://gerrit.wikimedia.org/r/769749 (https://phabricator.wikimedia.org/T300119) (owner: 10Herron)
[14:32:03] <wikibugs>	 (03CR) 10Volans: "LGTM (CI apart) not sure if worth waiting the changes into spicerack at this point." [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[14:32:08] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: C:varnish: introduce the X-Abuse-Network request "header" [puppet] - 10https://gerrit.wikimedia.org/r/769901 (https://phabricator.wikimedia.org/T302471)
[14:34:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] C:varnish: introduce the X-Abuse-Network request "header" [puppet] - 10https://gerrit.wikimedia.org/r/769901 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[14:34:59] <wikibugs>	 (03CR) 10Ottomata: Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) (owner: 10Ottomata)
[14:35:46] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10cmooney) @MatthewVernon Just to follow up having checked all network interfaces, forwarding tables and the end devices all looks to be working fine with ms-f...
[14:38:15] <wikibugs>	 (03PS1) 10DCausse: [wdqs] adapt updateQueryServiceLag... [puppet] - 10https://gerrit.wikimedia.org/r/770508 (https://phabricator.wikimedia.org/T302494)
[14:39:55] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) A possibile way forward is to modify https://gerrit.wikimedia.org/r/c/operations/puppet/+/763113 to avoid the profile::base::certificates profile, and modify the c...
[14:40:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P22427 and previous config saved to /var/cache/conftool/dbconfig/20220314-144007-marostegui.json
[14:40:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10akosiaris) Replying instead of Daniel, he is currently unavailable.  @Cmjohnson, I guess rows E & F are ok, I think it will be the first stuff we will be operating...
[14:43:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:45:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:46:02] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10MatthewVernon) 05Open→03Resolved Great, thanks. I think we can close this now :)
[14:47:27] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/770003 (owner: 10Jbond)
[14:47:34] <wikibugs>	 (03CR) 10Ottomata: Standardize the stats system user uid (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata)
[14:47:42] <wikibugs>	 (03PS3) 10Ottomata: Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384)
[14:47:48] <wikibugs>	 10SRE-swift-storage: Bring ms-fe10[09-12] into service - https://phabricator.wikimedia.org/T303698 (10MatthewVernon) 05Open→03Resolved All online OK.
[14:50:19] <wikibugs>	 10SRE-swift-storage: Decommission ms-fe100[5-8] - https://phabricator.wikimedia.org/T303733 (10MatthewVernon)
[14:50:59] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[14:51:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[14:51:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:51:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:51:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298563)', diff saved to https://phabricator.wikimedia.org/P22428 and previous config saved to /var/cache/conftool/dbconfig/20220314-145109-marostegui.json
[14:51:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:13] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[14:52:45] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) >>! In T300130#7774198, @elukey wrote: > A possibile way forward is to modify https://gerrit.wikimedia.org/r/c/operations/puppet/+/763113 to avoid the profile::...
[14:53:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[14:53:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[14:53:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T300775)', diff saved to https://phabricator.wikimedia.org/P22429 and previous config saved to /var/cache/conftool/dbconfig/20220314-145345-marostegui.json
[14:53:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:49] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[14:55:09] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1016.eqiad.wmnet with OS bullseye
[14:55:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P22430 and previous config saved to /var/cache/conftool/dbconfig/20220314-145512-marostegui.json
[14:55:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1016.eqiad.wmnet with OS bulls...
[14:57:12] <wikibugs>	 (03PS1) 10David Caro: [buildservice] Add a cookbook to update the needed images [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770519 (https://phabricator.wikimedia.org/T297090)
[14:57:35] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1017.eqiad.wmnet with OS bullseye
[14:57:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:14] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) >>! In T300130#7774257, @colewhite wrote: >>>! In T300130#7774198, @elukey wrote: >> A possibile way forward is to modify https://gerrit.wikimedia.org/r/c/operatio...
[15:01:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [buildservice] Add a cookbook to update the needed images [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770519 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro)
[15:10:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298294)', diff saved to https://phabricator.wikimedia.org/P22431 and previous config saved to /var/cache/conftool/dbconfig/20220314-151017-marostegui.json
[15:10:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[15:10:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[15:10:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:22] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[15:10:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T298294)', diff saved to https://phabricator.wikimedia.org/P22432 and previous config saved to /var/cache/conftool/dbconfig/20220314-151025-marostegui.json
[15:10:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:38] <wikibugs>	 (03PS1) 10Klausman: Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197)
[15:15:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[15:15:14] <wikibugs>	 10SRE, 10ops-codfw: codfw A1 power outage - https://phabricator.wikimedia.org/T303696 (10Papaul)  TICKET NO. 2213827 U  open with CY1
[15:15:42] <wikibugs>	 (03CR) 10Muehlenhoff: Standardize the stats system user uid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata)
[15:19:39] <wikibugs>	 (03PS1) 104nn1l2: liwiktionary: Change timezone to CET/CEST [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770523 (https://phabricator.wikimedia.org/T303734)
[15:24:32] <wikibugs>	 (03PS4) 10Ottomata: Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384)
[15:24:39] <icinga-wm>	 PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:24:39] <icinga-wm>	 PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:24:41] <icinga-wm>	 PROBLEM - Host asw-a-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:24:56] <volans>	 WUT? XioNoX ^^^
[15:25:15] <icinga-wm>	 PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:25:22] <XioNoX>	 papaul: PDU? ^
[15:25:43] <XioNoX>	 volans: I'd guess it's just the mgmt interface otherwise we would get much more noise
[15:25:52] <papaul>	 XioNoX: yes i am replacing the PDU
[15:25:54] <volans>	 unless they are propertly set in icinga
[15:26:13] <volans>	 why different rows are in the same pdu? which pdu did fail?
[15:26:49] <icinga-wm>	 PROBLEM - Host ripe-atlas-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[15:26:53] <papaul>	 ps1-a1
[15:27:39] <icinga-wm>	 PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:28:16] <volans>	 ah mr1, got it
[15:28:57] <icinga-wm>	 PROBLEM - Host mr1-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[15:30:04] <jouncebot>	 jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220314T1530).
[15:30:11] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:30:15] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[15:30:37] <wikibugs>	 (03PS5) 10Ottomata: Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384)
[15:31:13] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:31:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata)
[15:32:38] <wikibugs>	 (03PS6) 10Ottomata: Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384)
[15:33:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata)
[15:34:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298563)', diff saved to https://phabricator.wikimedia.org/P22434 and previous config saved to /var/cache/conftool/dbconfig/20220314-153428-marostegui.json
[15:34:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:33] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[15:34:44] <wikibugs>	 (03PS2) 10Klausman: Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197)
[15:34:56] <wikibugs>	 (03PS7) 10Ottomata: Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384)
[15:35:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman)
[15:35:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata)
[15:35:46] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34266/console" [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata)
[15:36:28] <wikibugs>	 (03PS2) 10Zabe: Migrate wmfDatacenter(s) to wmgDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768254 (https://phabricator.wikimedia.org/T45956)
[15:36:38] <wikibugs>	 (03CR) 10Muehlenhoff: Standardize the stats system user uid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata)
[15:38:27] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:38:29] <icinga-wm>	 RECOVERY - Juniper alarms on cr1-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[15:39:28] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] Standardize the stats system user uid (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata)
[15:39:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298294)', diff saved to https://phabricator.wikimedia.org/P22435 and previous config saved to /var/cache/conftool/dbconfig/20220314-153945-marostegui.json
[15:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:49] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[15:39:55] <papaul>	 XioNoX: the new PDU is in place it should clear all the alarm now
[15:40:10] <XioNoX>	 nice!
[15:40:11] <wikibugs>	 (03PS3) 10Klausman: Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197)
[15:40:47] <icinga-wm>	 RECOVERY - Host ripe-atlas-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 31.78 ms
[15:40:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman)
[15:42:28] <wikibugs>	 (03PS4) 10Klausman: Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197)
[15:44:47] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 202 probes of 671 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:46:00] <icinga-wm>	 RECOVERY - Host kubestage2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.40 ms
[15:46:01] <icinga-wm>	 RECOVERY - Host db2136.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.22 ms
[15:46:01] <icinga-wm>	 RECOVERY - Host mc2019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.00 ms
[15:46:02] <icinga-wm>	 RECOVERY - Host ml-serve2005.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 42.85 ms
[15:46:02] <icinga-wm>	 RECOVERY - Host re0.cr1-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.52 ms
[15:46:02] <icinga-wm>	 RECOVERY - Host scs-a1-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.65 ms
[15:46:04] <icinga-wm>	 RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.94 ms
[15:46:04] <icinga-wm>	 RECOVERY - Host asw-a-codfw is UP: PING OK - Packet loss = 0%, RTA = 40.79 ms
[15:46:04] <icinga-wm>	 RECOVERY - Host es2026.mgmt is UP: PING WARNING - Packet loss = 66%, RTA = 45.37 ms
[15:46:08] <icinga-wm>	 RECOVERY - Host db2075.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 46.78 ms
[15:46:40] <icinga-wm>	 RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms
[15:47:00] <icinga-wm>	 RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms
[15:47:02] <icinga-wm>	 RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms
[15:47:22] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:47:33] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34267/console" [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[15:48:16] <wikibugs>	 (03PS1) 10Klausman: Add DNS SRV records for ML staging etcd in codfw [dns] - 10https://gerrit.wikimedia.org/r/770529
[15:48:24] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:48:30] <icinga-wm>	 RECOVERY - Juniper alarms on asw-a-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[15:49:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P22436 and previous config saved to /var/cache/conftool/dbconfig/20220314-154933-marostegui.json
[15:49:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:02] <icinga-wm>	 RECOVERY - Host mr1-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 33.46 ms
[15:51:36] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 62 probes of 671 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:53:28] <wikibugs>	 10SRE, 10Security-Team, 10Performance-Team (Radar), 10SecTeam-Processed, 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10sbassett) 05Open→03Resolved Per the last recommendation from @Joe at T301428#7730915, we've decided to pursue MySQL/Maria as the primary backe...
[15:53:42] <icinga-wm>	 RECOVERY - IPMI Sensor Status on db2075 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:53:42] <icinga-wm>	 RECOVERY - IPMI Sensor Status on es2026 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:53:42] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ml-serve2005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:53:44] <icinga-wm>	 RECOVERY - IPMI Sensor Status on mc2019 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:54:10] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[15:54:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P22437 and previous config saved to /var/cache/conftool/dbconfig/20220314-155450-marostegui.json
[15:54:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata)
[15:55:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] vrts: rename mail module class variables [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[15:56:54] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] "looks good to me, minor suggestion in a comment" [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[16:03:25] <wikibugs>	 (03PS1) 10Ebernhardson: Cut saneitizer re-indexing rate in half [extensions/CirrusSearch] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770056 (https://phabricator.wikimedia.org/T302733)
[16:04:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P22438 and previous config saved to /var/cache/conftool/dbconfig/20220314-160438-marostegui.json
[16:04:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Add DNS SRV records for ML staging etcd in codfw [dns] - 10https://gerrit.wikimedia.org/r/770529 (owner: 10Klausman)
[16:09:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P22439 and previous config saved to /var/cache/conftool/dbconfig/20220314-160955-marostegui.json
[16:09:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:56] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle)
[16:18:21] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add DNS SRV records for ML staging etcd in codfw [dns] - 10https://gerrit.wikimedia.org/r/770529 (owner: 10Klausman)
[16:19:44] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] Add DNS SRV records for ML staging etcd in codfw [dns] - 10https://gerrit.wikimedia.org/r/770529 (owner: 10Klausman)
[16:19:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298563)', diff saved to https://phabricator.wikimedia.org/P22440 and previous config saved to /var/cache/conftool/dbconfig/20220314-161943-marostegui.json
[16:19:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance
[16:19:48] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[16:19:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance
[16:19:49] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 6 hosts with reason: Maintenance
[16:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 6 hosts with reason: Maintenance
[16:19:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:24] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, couple of questions/comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[16:21:40] <icinga-wm>	 PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:23:38] <icinga-wm>	 RECOVERY - Host ps1-a1-codfw is UP: PING OK - Packet loss = 0%, RTA = 35.26 ms
[16:24:54] <wikibugs>	 (03CR) 10Ahmon Dancy: "Just a typo nit. Otherwise I think I'll be able to work with this." [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto)
[16:25:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298294)', diff saved to https://phabricator.wikimedia.org/P22441 and previous config saved to /var/cache/conftool/dbconfig/20220314-162501-marostegui.json
[16:25:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[16:25:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:04] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[16:25:05] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[16:25:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T298294)', diff saved to https://phabricator.wikimedia.org/P22442 and previous config saved to /var/cache/conftool/dbconfig/20220314-162509-marostegui.json
[16:25:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:56] <wikibugs>	 (03PS1) 10David Caro: Refactor dologmsg [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770547 (https://phabricator.wikimedia.org/T297090)
[16:26:03] <wikibugs>	 (03PS1) 10David Caro: buildservice: Add some sal logs when updating the base images [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770548 (https://phabricator.wikimedia.org/T297090)
[16:28:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Refactor dologmsg [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770547 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro)
[16:28:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans)
[16:28:46] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: utils: add script to sync abuse networks with conftool ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/767489 (https://phabricator.wikimedia.org/T302471)
[16:28:48] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: conftool-data: add phabricator_abusers to ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/770551
[16:28:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] buildservice: Add some sal logs when updating the base images [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770548 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro)
[16:29:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool-data: add phabricator_abusers to ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/770551 (owner: 10Giuseppe Lavagetto)
[16:39:39] <wikibugs>	 10SRE, 10LDAP, 10User-jbond: Migrate web services using LDAP authentication towards the readonly LDAP replicas - https://phabricator.wikimedia.org/T227650 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[16:39:44] <icinga-wm>	 PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:40:08] <wikibugs>	 (03CR) 10Volans: "one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/767489 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[16:46:17] <wikibugs>	 (03PS1) 10Klausman: Add dummy key for ML staging etcd in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/770554
[16:47:17] <wikibugs>	 (03PS1) 10Andrew Bogott: netboot: switch cloudvirt102[1-9] partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/770555 (https://phabricator.wikimedia.org/T281276)
[16:48:20] <wikibugs>	 10SRE, 10Release Pipeline, 10serviceops, 10Goal, 10Release-Engineering-Team (Seen): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10akosiaris)
[16:48:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Add dummy key for ML staging etcd in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/770554 (owner: 10Klausman)
[16:49:11] <wikibugs>	 10SRE, 10Release Pipeline, 10serviceops, 10Goal, 10Release-Engineering-Team (Seen): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10akosiaris) 05Open→03Resolved a:03akosiaris Resolving. Wikifeeds has been migrated, restrouter migration was cancelled, the process is d...
[16:49:17] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] Add dummy key for ML staging etcd in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/770554 (owner: 10Klausman)
[16:49:23] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] Add dummy key for ML staging etcd in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/770554 (owner: 10Klausman)
[16:49:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298294)', diff saved to https://phabricator.wikimedia.org/P22444 and previous config saved to /var/cache/conftool/dbconfig/20220314-164927-marostegui.json
[16:49:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:32] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[16:50:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] netboot: switch cloudvirt102[1-9] partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/770555 (https://phabricator.wikimedia.org/T281276) (owner: 10Andrew Bogott)
[16:52:00] <wikibugs>	 (03CR) 10Bking: [C: 03+2] [wdqs] switch wdqs1010 to the streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/742670 (https://phabricator.wikimedia.org/T301108) (owner: 10DCausse)
[16:52:47] <jinxer-wm>	 (Device rebooted) firing: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - Device rebooted   - https://alerts.wikimedia.org
[17:00:05] <jouncebot>	 ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220314T1700).
[17:00:38] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902
[17:01:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto)
[17:02:47] <jinxer-wm>	 (Device rebooted) resolved: Device ps1-a1-codfw.mgmt.codfw.wmnet recovered from Device rebooted   - https://alerts.wikimedia.org
[17:04:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P22445 and previous config saved to /var/cache/conftool/dbconfig/20220314-170432-marostegui.json
[17:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:20] <wikibugs>	 (03PS1) 10JMeybohm: Prevent allocation of nodePorts when ingress is used [deployment-charts] - 10https://gerrit.wikimedia.org/r/770556 (https://phabricator.wikimedia.org/T290966)
[17:15:21] <wikibugs>	 10SRE, 10ops-codfw: codfw A1 power outage - https://phabricator.wikimedia.org/T303696 (10Papaul) 05Open→03Resolved Replaced the PDU with a spare one we had on site.
[17:18:46] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:18:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:01] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902
[17:19:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P22446 and previous config saved to /var/cache/conftool/dbconfig/20220314-171937-marostegui.json
[17:19:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto)
[17:23:28] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:23:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:43] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission cumin2001 - https://phabricator.wikimedia.org/T303399 (10Papaul)
[17:24:04] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission cumin2001 - https://phabricator.wikimedia.org/T303399 (10Papaul) 05Open→03Resolved complete
[17:30:15] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[17:31:32] <wikibugs>	 (03PS1) 10WMDE-Fisch: Fix copy-paste mistake in template search widget [extensions/TemplateWizard] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770057 (https://phabricator.wikimedia.org/T303524)
[17:32:41] <wikibugs>	 (03PS5) 10Klausman: Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197)
[17:34:17] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902
[17:34:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298294)', diff saved to https://phabricator.wikimedia.org/P22448 and previous config saved to /var/cache/conftool/dbconfig/20220314-173442-marostegui.json
[17:34:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[17:34:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[17:34:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:47] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[17:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:43] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "this breaks varnish/text/31-blocked-nets.vtc" [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto)
[17:40:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10JayCano) I just wanted to confirm that I approve of this request and I'm available for any questions. Thank you!
[17:44:38] <logmsgbot>	 !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@63af538] (eqiad): Enable 100% traffic mirroring on eqiad
[17:44:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:42] <logmsgbot>	 !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@63af538] (eqiad): Enable 100% traffic mirroring on eqiad (duration: 01m 04s)
[17:45:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:46] <Amir1>	 !log start of  foreachwikiindblist all maintenance/refreshImageMetadata.php --force --verbose --mediatype=AUDIO --sleep 2 (T226311)
[17:47:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:49] <stashbot>	 T226311: Some WebM video files are misdetected as audio files due to the MIME detector not scanning enough bytes - https://phabricator.wikimedia.org/T226311
[17:53:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[17:53:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[17:53:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T298294)', diff saved to https://phabricator.wikimedia.org/P22449 and previous config saved to /var/cache/conftool/dbconfig/20220314-175352-marostegui.json
[17:53:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:59] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[17:54:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto)
[17:54:21] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] utils: add script to sync abuse networks with conftool ipblocks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/767489 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[17:54:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto)
[17:54:45] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902
[18:00:25] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902
[18:03:55] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: Add guw to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/770565 (https://phabricator.wikimedia.org/T303727)
[18:05:55] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[18:14:19] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1021.eqiad.wmnet with OS bullseye
[18:14:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I just realized this approach doesn't work for abuse networks:" [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto)
[18:16:06] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] Add guw to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/770565 (https://phabricator.wikimedia.org/T303727) (owner: 10Gerrit maintenance bot)
[18:17:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298294)', diff saved to https://phabricator.wikimedia.org/P22450 and previous config saved to /var/cache/conftool/dbconfig/20220314-181709-marostegui.json
[18:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:14] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[18:25:15] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1021.eqiad.wmnet with reason: host reimage
[18:25:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:06] <wikibugs>	 10SRE-OnFire, 10DBA, 10Performance-Team (Radar), 10Sustainability (Incident Followup), 10Wikimedia-Incident: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest of the database infrastructure - https://phabricator.wikimedia.org/T303499 (10Krink...
[18:28:40] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1021.eqiad.wmnet with reason: host reimage
[18:28:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:44] <wikibugs>	 10SRE-OnFire, 10DBA, 10Performance-Team (Radar), 10Sustainability (Incident Followup), 10Wikimedia-Incident: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest of the database infrastructure - https://phabricator.wikimedia.org/T303499 (10Krink...
[18:32:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P22451 and previous config saved to /var/cache/conftool/dbconfig/20220314-183214-marostegui.json
[18:32:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:28] <icinga-wm>	 RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:47:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P22452 and previous config saved to /var/cache/conftool/dbconfig/20220314-184719-marostegui.json
[18:47:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:14] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1021.eqiad.wmnet with OS bullseye
[18:51:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:25] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1022.eqiad.wmnet with OS bullseye
[18:54:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300775)', diff saved to https://phabricator.wikimedia.org/P22453 and previous config saved to /var/cache/conftool/dbconfig/20220314-185849-marostegui.json
[18:58:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:54] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[19:02:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298294)', diff saved to https://phabricator.wikimedia.org/P22454 and previous config saved to /var/cache/conftool/dbconfig/20220314-190224-marostegui.json
[19:02:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:29] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[19:04:39] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1022.eqiad.wmnet with reason: host reimage
[19:04:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:20] <icinga-wm>	 PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[19:06:20] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[19:07:20] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1022.eqiad.wmnet with reason: host reimage
[19:07:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:27] <icinga-wm>	 RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 226.71 ms
[19:11:28] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 244.25 ms
[19:13:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P22455 and previous config saved to /var/cache/conftool/dbconfig/20220314-191354-marostegui.json
[19:13:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:54] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1022.eqiad.wmnet with OS bullseye
[19:24:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P22456 and previous config saved to /var/cache/conftool/dbconfig/20220314-192859-marostegui.json
[19:29:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10nskaggs) a:05nskaggs→03None
[19:43:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10nskaggs) a:03RobH Thanks Arzhel! I don't believe anything else is needed from me. Assigning back to @RobH. Feel free to ping a...
[19:44:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300775)', diff saved to https://phabricator.wikimedia.org/P22457 and previous config saved to /var/cache/conftool/dbconfig/20220314-194404-marostegui.json
[19:44:06] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[19:44:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[19:44:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:09] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[19:44:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:32] <wikibugs>	 (03PS2) 10Ssingh: certspotter: re-enable systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/770012 (https://phabricator.wikimedia.org/T303593)
[19:47:00] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 04-1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34280/console" [puppet] - 10https://gerrit.wikimedia.org/r/770012 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh)
[19:47:13] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: re-enable systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/770012 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh)
[19:54:20] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:57:30] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:00:04] <jouncebot>	 RoanKattouw and Urbanecm: Dear deployers, time to do the UTC late backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220314T2000).
[20:00:04] <jouncebot>	 nn1l2 and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:15] <urbanecm>	 I can deploy today
[20:00:49] <urbanecm>	 I don't see nn1l2. ebernhardson, do you want to start with your patch?
[20:00:56] <ebernhardson>	 i actually have a meeting now, will deploy in 30 min
[20:01:23] <urbanecm>	 ebernhardson: okay, happy meeting then :). let's wait.
[20:11:32] <ebernhardson>	 urbanecm: meeting done quickly :) Shipping now 
[20:11:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10RobH) So these will need to go into WMCS dedicated 10G racks, not in rows E/F, which have access to the public1 vlan.
[20:11:54] <urbanecm>	 ebernhardson: sure thing. Ping me when done (or if you need my help).
[20:12:18] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] "backport window" [extensions/CirrusSearch] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770056 (https://phabricator.wikimedia.org/T302733) (owner: 10Ebernhardson)
[20:12:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10RobH) a:05RobH→03nskaggs
[20:14:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10RobH)
[20:14:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10RobH) a:05nskaggs→03Jclark-ctr
[20:21:10] <wikibugs>	 10SRE, 10envoy, 10serviceops: Clean up Puppet support for Envoy v2 config API - https://phabricator.wikimedia.org/T303770 (10RLazarus)
[20:22:01] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye
[20:22:02] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[20:22:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:06] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:27:56] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10envoy, 10serviceops: Clean up Puppet support for Envoy v2 config API - https://phabricator.wikimedia.org/T303770 (10RLazarus)
[20:30:09] <wikibugs>	 (03Merged) 10jenkins-bot: Cut saneitizer re-indexing rate in half [extensions/CirrusSearch] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770056 (https://phabricator.wikimedia.org/T302733) (owner: 10Ebernhardson)
[20:30:45] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[20:30:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:00] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[20:31:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:07] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[20:31:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:44] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[20:31:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:51] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[20:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:33:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:54] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[20:33:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:34:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:34:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:35:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:57] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1023.eqiad.wmnet with reason: host reimage
[20:35:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:37] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1023.eqiad.wmnet with reason: host reimage
[20:38:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:34] <nn1l2>	 is B&C still running?
[20:45:17] <ebernhardson>	 nn1l2: i'm just pushing a patch now (gerrit always takes time to merge). I can do yours next
[20:45:29] <ebernhardson>	 or if urbanecm is around they were going to i think
[20:45:44] <urbanecm>	 ebernhardson: up to you. You can do it or I can
[20:45:44] <logmsgbot>	 !log ebernhardson@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/CirrusSearch/profiles/SaneitizeProfiles.config.php: Backport: [[gerrit:770056|Cut saneitizer re-indexing rate in half (T302733)]] (duration: 00m 49s)
[20:45:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:48] <stashbot>	 T302733: Restore CirrusSearch saneitizer to production usage - https://phabricator.wikimedia.org/T302733
[20:46:05] <nn1l2>	 There have been too many changes in the schedule recently
[20:46:21] <ebernhardson>	 urbanecm: you're more an expert than me these days, go ahead :)
[20:46:26] <urbanecm>	 ebernhardson: will do :)
[20:46:34] <ebernhardson>	 mines complete
[20:46:37] <urbanecm>	 ack
[20:46:54] <urbanecm>	 nn1l2: the schedule is the same as it was the previous week. It's "just" pinned to PDT timezone, not UTC
[20:47:29] <urbanecm>	 (to be more precise, P(D/S)T)
[20:47:30] <AntiComposite>	 there is no good way to handle DST unfortunately
[20:47:46] <urbanecm>	 and yeah, US and Europe starts DST at different times of the year
[20:48:30] <urbanecm>	 anyway, let's start
[20:48:45] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] liwiktionary: Change timezone to CET/CEST [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770523 (https://phabricator.wikimedia.org/T303734) (owner: 104nn1l2)
[20:49:31] <wikibugs>	 (03Merged) 10jenkins-bot: liwiktionary: Change timezone to CET/CEST [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770523 (https://phabricator.wikimedia.org/T303734) (owner: 104nn1l2)
[20:49:43] <urbanecm>	 nn1l2: fyi there will be another shift for similar reasons in a week or two (when Europe gets to DST)
[20:49:59] <nn1l2>	 Thanks!
[20:50:05] <urbanecm>	 (but in the opposite direction)
[20:50:36] <urbanecm>	 nn1l2: pulled to mwdebug1001
[20:50:38] <urbanecm>	 can you check?
[20:50:43] <nn1l2>	 ok
[20:52:29] <nn1l2>	 LGTM
[20:52:43] <urbanecm>	 syncing
[20:53:51] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bca9c94c9d0bec83cb777bc474fde564c441349c: liwiktionary: Change timezone to CET/CEST (T303734) (duration: 00m 49s)
[20:53:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:56] <urbanecm>	 nn1l2: should be live!
[20:53:57] <stashbot>	 T303734: Change time on li.wiktionary to local time zone - https://phabricator.wikimedia.org/T303734
[20:53:57] <urbanecm>	 anything else
[20:54:04] <nn1l2>	 No, thanks!
[20:54:38] <urbanecm>	 no problem!
[20:54:49] <urbanecm>	 !log UTC late B&C completed
[20:54:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:06] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:55:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:55:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:56:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:56:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:57:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:04] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[20:58:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:04] <jouncebot>	 Reedy and sbassett: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220314T2100).
[21:07:11] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1023.eqiad.wmnet with OS bullseye
[21:07:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:14] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: build2001, cloudcontrol1003, cloudcontrol1004, cloudcontrol1005, gitlab-runner2001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[21:09:14] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: build2001, cloudcontrol1003, cloudcontrol1004, cloudcontrol1005, gitlab-runner2001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[21:10:31] <wikibugs>	 (03PS1) 10Ssingh: certspotter: set send_mail_only_on_error to false [puppet] - 10https://gerrit.wikimedia.org/r/770600
[21:12:20] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34281/console" [puppet] - 10https://gerrit.wikimedia.org/r/770600 (owner: 10Ssingh)
[21:14:57] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: set send_mail_only_on_error to false [puppet] - 10https://gerrit.wikimedia.org/r/770600 (owner: 10Ssingh)
[21:23:31] <sbassett>	 Hey all - I'd like to deploy a quick security patch (perm check) for T160800
[21:23:43] <wikibugs>	 (03CR) 10Bking: [C: 03+2] [wdqs] adapt updateQueryServiceLag... [puppet] - 10https://gerrit.wikimedia.org/r/770508 (https://phabricator.wikimedia.org/T302494) (owner: 10DCausse)
[21:29:31] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1114 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:30:15] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye
[21:30:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:08] <sbassett>	 !log Deployed security fix for T160800
[21:31:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:53] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:34:41] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:35:43] <wikibugs>	 (03PS1) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/770605 (https://phabricator.wikimedia.org/T301565)
[21:36:08] <inflatador>	 !log bking@cumin pooling codfw in DNS-discovery for wdqs and wdqs-internal services
[21:36:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:36:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/770605 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi)
[21:36:33] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1114 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:37:03] <logmsgbot>	 !log bking@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=codfw
[21:37:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:38:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:30] <inflatador>	 !log T302494 bking@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs-internal,name=codfw
[21:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:35] <stashbot>	 T302494: The WDQS Streaming Updater should use S3 to access thanos-swift instead of the native swift protocol - https://phabricator.wikimedia.org/T302494
[21:39:07] <wikibugs>	 (03PS1) 10Andrew Bogott: Update nic labels for cloudvirt1023/bullseye [puppet] - 10https://gerrit.wikimedia.org/r/770606 (https://phabricator.wikimedia.org/T281276)
[21:39:28] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1025.eqiad.wmnet with OS bullseye
[21:39:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:39:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:47] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye
[21:39:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:40:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Update nic labels for cloudvirt1023/bullseye [puppet] - 10https://gerrit.wikimedia.org/r/770606 (https://phabricator.wikimedia.org/T281276) (owner: 10Andrew Bogott)
[21:40:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:40:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:47:56] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1025.eqiad.wmnet with OS bullseye
[21:47:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:08] <wikibugs>	 (03PS1) 10Ssingh: certspotter: more tuning: use OnUnitInactiveSec [puppet] - 10https://gerrit.wikimedia.org/r/770611
[21:55:08] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:56:04] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34282/console" [puppet] - 10https://gerrit.wikimedia.org/r/770611 (owner: 10Ssingh)
[22:03:20] <inflatador>	 !log T302494 bking@puppetmaster1001 depooling eqiad in DNS-discovery for wdqs and wdqs-internal services
[22:03:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:24] <stashbot>	 T302494: The WDQS Streaming Updater should use S3 to access thanos-swift instead of the native swift protocol - https://phabricator.wikimedia.org/T302494
[22:03:44] <logmsgbot>	 !log bking@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=eqiad
[22:03:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:58] <logmsgbot>	 !log bking@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=wdqs-internal,name=eqiad
[22:04:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:26] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye
[22:04:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:05:55] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[22:06:08] <rzl>	 jouncebot: nowandnext
[22:06:08] <jouncebot>	 For the next 0 hour(s) and 53 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220314T2100)
[22:06:09] <jouncebot>	 In 2 hour(s) and 53 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220315T0100)
[22:06:37] <rzl>	 bumping all the remaining appservers and restbase machines to envoy 1.18
[22:06:51] <rzl>	 no impact expected, the canaries were fine all weekend
[22:16:35] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1023.eqiad.wmnet with reason: host reimage
[22:16:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:14] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1023.eqiad.wmnet with reason: host reimage
[22:19:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:22] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: fix data-transfer usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/770614
[22:21:05] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: fix data-transfer usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/770614
[22:21:19] <wikibugs>	 (03CR) 10Bking: [C: 03+1] wdqs: fix data-transfer usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/770614 (owner: 10Ryan Kemper)
[22:22:06] <Josve05a>	 https://www.irccloud.com/pastebin/iFZnzWnM/
[22:22:15] <Josve05a>	 oops...
[22:22:17] <Josve05a>	 Hi all...Should a person with the following user sting be able to access Wikipedia (given HTTPS HSTS)?
[22:22:17] <Josve05a>	 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) 
[22:22:17] <Josve05a>	 I'm thinking..yes? They are using a lower version of macOS than recommended on
[22:22:17] <Josve05a>	 https://wikitech.wikimedia.org/wiki/HTTPS/Browser_Recommendations#For_users_of_Apple_macOS
[22:22:17] <Josve05a>	 but they are using the lastest version of Chrome...
[22:23:09] <Josve05a>	 .... Chrome/99.0.4844.51 Safari/537.36
[22:24:06] <Josve05a>	 i.e. Chrome 99 on Mac OS X (El Capitan)
[22:25:18] <rzl>	 Josve05a: unfortunately this is a known issue affecting certain older clients, including anything running on OS X 10.11 and earlier -- https://meta.wikimedia.org/wiki/HTTPS/2021_Let%27s_Encrypt_root_expiry has details
[22:26:58] <Josve05a>	 Ah, I was pretty much up to date than, only that I thought (given info on https://wikitech.wikimedia.org/wiki/HTTPS/Browser_Recommendations) that old Macs could still access Wikipedia if they had an updated compatible browser
[22:27:08] <Josve05a>	 but then I know, thanks!
[22:27:13] <Josve05a>	 then*
[22:27:35] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[22:27:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:28:14] <rzl>	 Josve05a: if upgrading macOS isn't an option then Firefox might work, but I think all the other major browsers depend on the OS for this
[22:28:25] <ryankemper>	 !log T301108 `ryankemper@cumin1001:~$ sudo cookbook sre.wdqs.data-transfer --source wdqs1009.eqiad.wmnet --dest wdqs1010.eqiad.wmnet --reason "moving away from legacy updater" --blazegraph_instance wikidata --without-lvs --task-id T301108` on tmux `wdqs`
[22:28:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:28:29] <stashbot>	 T301108: Migrate wdqs1010 to the Flink based Streaming Updater and cleanup left over pieces of the old updater - https://phabricator.wikimedia.org/T301108
[22:28:43] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[22:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:31:43] <Josve05a>	 rzl: Ah, thanks!
[22:32:01] <Josve05a>	 We need to update that page above and some VRT response templates it seems...
[22:32:20] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[22:32:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:32:46] <rzl>	 yeah - that firefox tip is just from glancing at letsencrypt documentation, I don't have a machine handy to test but I'll ask around to confirm
[22:33:07] <rzl>	 one way or the other I'll see about getting that page clarified
[22:34:04] <rzl>	 (from the edit history, it looks like the "If that is not possible, [...] consider installing an alternate secure browser" sentence is older than the LE issue, so it was probably a workaround for a previous Safari issue)
[22:37:12] <Josve05a>	 Yeah, I'll relay that information to the end-user and see if they can get Firefox to work or if they can somehow upgrade their OS
[22:37:22] <Josve05a>	 Thanks again
[22:37:27] <rzl>	 👍
[22:44:15] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: more tuning: use OnUnitInactiveSec [puppet] - 10https://gerrit.wikimedia.org/r/770611 (owner: 10Ssingh)
[23:08:49] <wikibugs>	 10SRE, 10Security-Team, 10Stewards-and-global-tools: Investigate the practice of making thousands of global blocks per day on Meta-Wiki - https://phabricator.wikimedia.org/T303774 (10AntiCompositeNumber) This is necessary mitigation for T265845.  > Is issuing thousands of global blocks per day now an accepte...
[23:20:17] <wikibugs>	 (03PS1) 10Tim Starling: populateGlobalEditCount.php: skip lu_global_id=0 and add restart option [extensions/CentralAuth] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770058
[23:20:29] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] populateGlobalEditCount.php: skip lu_global_id=0 and add restart option [extensions/CentralAuth] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770058 (owner: 10Tim Starling)
[23:22:42] <wikibugs>	 (03Merged) 10jenkins-bot: populateGlobalEditCount.php: skip lu_global_id=0 and add restart option [extensions/CentralAuth] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770058 (owner: 10Tim Starling)
[23:26:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[23:26:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[23:27:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[23:27:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:28:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[23:28:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[23:44:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[23:44:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22460 and previous config saved to /var/cache/conftool/dbconfig/20220314-234430-marostegui.json
[23:44:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:34] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775