[00:00:40] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:00] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:09:10] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:15] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:45:15] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:54:02] 10Puppet, 10SRE, 10Infrastructure-Foundations: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10lmata) p:05Triage→03Medium sgtm [02:05:54] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [02:10:26] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:19:48] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:21:28] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:25:20] (03PS2) 10JHathaway: Move vendored modules to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/770099 (https://phabricator.wikimedia.org/T302423) [03:27:14] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jhathaway) There seems to be some coalescing around moving vendored modules into their own directory, here is a patch that does just that, feedback very much appreciated,... [03:59:30] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) For this round of reimaging I'm happy to just edit the options while reimaging, but - I'll want to do this myself so I do... [05:39:55] 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10odimitrijevic) [05:45:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [06:05:54] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [06:05:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:05:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:12:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:38] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:18:41] (03PS1) 10Marostegui: dbproxy1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/770315 [06:19:51] (03PS1) 10Marostegui: Revert "wmnet: Switchover m2-master" [dns] - 10https://gerrit.wikimedia.org/r/770052 [06:19:58] (03CR) 10Marostegui: [C: 03+2] dbproxy1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/770315 (owner: 10Marostegui) [06:20:12] (03PS2) 10Marostegui: Revert "wmnet: Switchover m2-master" [dns] - 10https://gerrit.wikimedia.org/r/770052 [06:22:51] (03PS1) 10Marostegui: Revert "dbproxy1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/770053 [06:24:42] (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/770316 [06:25:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance [06:25:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance [06:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 12 hosts with reason: Maintenance [06:25:20] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/770316 (owner: 10Marostegui) [06:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 12 hosts with reason: Maintenance [06:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:47] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/770053 (owner: 10Marostegui) [06:30:28] (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Switchover m2-master" [dns] - 10https://gerrit.wikimedia.org/r/770052 (owner: 10Marostegui) [06:30:32] (03PS3) 10Marostegui: Revert "wmnet: Switchover m2-master" [dns] - 10https://gerrit.wikimedia.org/r/770052 [06:35:02] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:44:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:44:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220313T0800) [07:00:05] Amir1, awight, Urbanecm, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220314T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:22] aren't these usually an hour later? [07:01:02] * taavi can't guarantee being here this early [07:02:22] I only know about the thursday ones [07:02:25] those are an hour later [07:02:57] those are usually the same time as the normal windows [07:03:14] oh [07:03:19] maybe some DST mess? [07:03:21] it's daylight something-or-other [07:03:30] where one part of the world switched and not the other [07:03:33] sigh [07:03:49] yeah because my thursday one is set for 9 am also and that's early [07:03:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1121.eqiad.wmnet with reason: Maintenance [07:03:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1121.eqiad.wmnet with reason: Maintenance [07:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T298294)', diff saved to https://phabricator.wikimedia.org/P22375 and previous config saved to /var/cache/conftool/dbconfig/20220314-070404-marostegui.json [07:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:08] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [07:07:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:07:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T298563)', diff saved to https://phabricator.wikimedia.org/P22376 and previous config saved to /var/cache/conftool/dbconfig/20220314-070721-marostegui.json [07:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:25] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [07:11:39] !log dbmaint on s7@eqiad T300775 [07:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:45] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [07:12:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [07:12:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [07:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 8 hosts with reason: Maintenance [07:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 8 hosts with reason: Maintenance [07:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:51] !log restart varnishkafka-webrequest on cp6001 to test a metric issue [07:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:20] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:24:32] (03PS1) 10Elukey: Set bullseye and overlayfs for kubernetes2017 [puppet] - 10https://gerrit.wikimedia.org/r/770439 (https://phabricator.wikimedia.org/T300744) [07:24:34] (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1007 [puppet] - 10https://gerrit.wikimedia.org/r/770440 (https://phabricator.wikimedia.org/T300744) [07:33:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298294)', diff saved to https://phabricator.wikimedia.org/P22377 and previous config saved to /var/cache/conftool/dbconfig/20220314-073313-marostegui.json [07:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:17] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [07:33:38] PROBLEM - Host ps1-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [07:36:52] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:39:51] (03CR) 10Giuseppe Lavagetto: [C: 03+1] C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [07:40:35] (03CR) 10Giuseppe Lavagetto: [C: 03+1] C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond) [07:43:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298563)', diff saved to https://phabricator.wikimedia.org/P22378 and previous config saved to /var/cache/conftool/dbconfig/20220314-074323-marostegui.json [07:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:28] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [07:48:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P22379 and previous config saved to /var/cache/conftool/dbconfig/20220314-074818-marostegui.json [07:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:04] (03PS1) 10Marostegui: Revert "wmnet: Failover m3-master" [dns] - 10https://gerrit.wikimedia.org/r/770054 [07:49:32] (03PS2) 10Marostegui: Revert "wmnet: Failover m3-master" [dns] - 10https://gerrit.wikimedia.org/r/770054 [07:50:30] (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Failover m3-master" [dns] - 10https://gerrit.wikimedia.org/r/770054 (owner: 10Marostegui) [07:56:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/769996 (https://phabricator.wikimedia.org/T303031) (owner: 10Ayounsi) [07:58:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P22380 and previous config saved to /var/cache/conftool/dbconfig/20220314-075828-marostegui.json [07:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P22381 and previous config saved to /var/cache/conftool/dbconfig/20220314-080323-marostegui.json [08:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:22] PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [08:06:28] PROBLEM - Host asw-a-codfw is DOWN: PING CRITICAL - Packet loss = 100% [08:07:54] PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100% [08:07:56] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [08:08:02] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [08:09:40] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:09:40] PROBLEM - Host mr1-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:09:42] PROBLEM - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - 6 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [08:10:15] (JobUnavailable) firing: (3) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [08:10:16] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:10:41] RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.00 ms [08:10:45] RECOVERY - Host asw-a-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.53 ms [08:10:47] RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.44 ms [08:10:49] PROBLEM - Host db2075.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:10:49] PROBLEM - Host db2136.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:10:51] PROBLEM - Host es2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:10:51] PROBLEM - Host kubestage2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:10:51] PROBLEM - Host mc2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:10:51] PROBLEM - Host ml-serve2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:10:51] RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms [08:10:53] PROBLEM - Host ps1-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [08:10:53] PROBLEM - Host re0.cr1-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:10:53] PROBLEM - Host scs-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [08:11:22] (03CR) 10JMeybohm: [C: 03+1] Set bullseye and overlayfs for kubernetes2017 [puppet] - 10https://gerrit.wikimedia.org/r/770439 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:11:33] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.48 ms [08:12:31] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:13:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P22382 and previous config saved to /var/cache/conftool/dbconfig/20220314-081333-marostegui.json [08:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:29] PROBLEM - Juniper alarms on asw-a-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [08:15:42] (03CR) 10Ayounsi: [C: 03+2] Add dalezhou to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/769996 (https://phabricator.wikimedia.org/T303031) (owner: 10Ayounsi) [08:16:31] RECOVERY - Host mr1-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 33.47 ms [08:16:59] (JobUnavailable) firing: (3) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [08:17:50] (03PS1) 10Marostegui: site.pp: Specify db1132 status [puppet] - 10https://gerrit.wikimedia.org/r/770443 (https://phabricator.wikimedia.org/T303395) [08:18:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298294)', diff saved to https://phabricator.wikimedia.org/P22383 and previous config saved to /var/cache/conftool/dbconfig/20220314-081828-marostegui.json [08:18:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1141.eqiad.wmnet with reason: Maintenance [08:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1141.eqiad.wmnet with reason: Maintenance [08:18:32] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [08:18:34] (03CR) 10Marostegui: [C: 03+2] site.pp: Specify db1132 status [puppet] - 10https://gerrit.wikimedia.org/r/770443 (https://phabricator.wikimedia.org/T303395) (owner: 10Marostegui) [08:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298294)', diff saved to https://phabricator.wikimedia.org/P22384 and previous config saved to /var/cache/conftool/dbconfig/20220314-081836-marostegui.json [08:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10ayounsi) 05Open→03Resolved @Dale_Zhou your account has been created, please reopen the task if you're having any issues. You can find instructio... [08:27:02] (03PS4) 10Muehlenhoff: Remove cumin2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/769712 (https://phabricator.wikimedia.org/T303399) [08:28:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298563)', diff saved to https://phabricator.wikimedia.org/P22385 and previous config saved to /var/cache/conftool/dbconfig/20220314-082838-marostegui.json [08:28:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:28:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:44] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [08:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298563)', diff saved to https://phabricator.wikimedia.org/P22386 and previous config saved to /var/cache/conftool/dbconfig/20220314-082846-marostegui.json [08:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516) (owner: 10Btullis) [08:31:39] PROBLEM - IPMI Sensor Status on db2075 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:31:54] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Stop loading wddx PHP extension with PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/769745 (https://phabricator.wikimedia.org/T295725) (owner: 10JMeybohm) [08:32:19] (03CR) 10JMeybohm: [C: 03+2] Make k8s-ingress-wikikube page [puppet] - 10https://gerrit.wikimedia.org/r/767078 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [08:34:15] (03CR) 10DCausse: "unclear why but looking at PCC we seem to fix T303256 by making -DwikibaseSomeValueMode=skolem effective again. I'm not sure I understand " [puppet] - 10https://gerrit.wikimedia.org/r/742670 (https://phabricator.wikimedia.org/T301108) (owner: 10DCausse) [08:38:59] (03CR) 10Elukey: [C: 03+2] Set bullseye and overlayfs for kubernetes2017 [puppet] - 10https://gerrit.wikimedia.org/r/770439 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:39:31] PROBLEM - IPMI Sensor Status on ml-serve2005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:39:41] ah lovely [08:39:50] this is one of the new nodes [08:40:06] anyway no user traffic, will open a task [08:40:07] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 79 probes of 669 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:40:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298294)', diff saved to https://phabricator.wikimedia.org/P22387 and previous config saved to /var/cache/conftool/dbconfig/20220314-084036-marostegui.json [08:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:40] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [08:44:55] PROBLEM - IPMI Sensor Status on mc2019 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:46:28] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2017.codfw.wmnet with OS bullseye [08:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:53] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 59 probes of 669 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:47:05] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10ayounsi) Agreed! > i.e. our interface automation should check the adjacent ports, and not allow ge-0/0/1 to be created if xe-0/0/0 exists.... [08:47:57] (03CR) 10Ayounsi: "Can you share the Jinja side as well so I can review the full picture?" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [08:48:35] 10SRE, 10serviceops, 10User-jijiki: Move debugging symbols and tools to a new class - https://phabricator.wikimedia.org/T236048 (10MoritzMuehlenhoff) 05Open→03Declined This doesn't seem relevant any more, I'll boldly go ahead and close it. We originally used it for HHVM and these days we can easily insta... [08:53:26] (KubernetesCalicoDown) firing: kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [08:53:37] (03PS1) 10Marostegui: wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/770446 [08:54:12] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/770446 (owner: 10Marostegui) [08:54:34] PROBLEM - IPMI Sensor Status on es2026 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:55:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P22388 and previous config saved to /var/cache/conftool/dbconfig/20220314-085541-marostegui.json [08:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:55] 10ops-codfw: codfw A1 power outage - https://phabricator.wikimedia.org/T303696 (10ayounsi) p:05Triage→03High [09:00:37] 10ops-codfw: codfw A1 power outage - https://phabricator.wikimedia.org/T303696 (10ayounsi) Surprisingly both msw1-codfw PSUs are ON: ` msw1-codfw> show chassis environment Class Item Status Measurement Power FPC 0 Power Supply 0 OK FPC 0 Power Supply 1... [09:01:14] ACKNOWLEDGEMENT - Host scs-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696 [09:01:14] ACKNOWLEDGEMENT - Host re0.cr1-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696 [09:01:14] ACKNOWLEDGEMENT - ps1-a1-codfw-infeed-load-tower-B-phase-Z on ps1-a1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:14] ACKNOWLEDGEMENT - ps1-a1-codfw-infeed-load-tower-B-phase-Y on ps1-a1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:14] ACKNOWLEDGEMENT - ps1-a1-codfw-infeed-load-tower-B-phase-X on ps1-a1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:14] ACKNOWLEDGEMENT - ps1-a1-codfw-infeed-load-tower-A-phase-Z on ps1-a1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:14] ACKNOWLEDGEMENT - ps1-a1-codfw-infeed-load-tower-A-phase-Y on ps1-a1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:15] ACKNOWLEDGEMENT - ps1-a1-codfw-infeed-load-tower-A-phase-X on ps1-a1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:15] ACKNOWLEDGEMENT - Host ps1-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696 [09:01:32] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2017.codfw.wmnet with reason: host reimage [09:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:06] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:02:09] ACKNOWLEDGEMENT - SSH on ml-serve2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:09] ACKNOWLEDGEMENT - Host ml-serve2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696 [09:02:09] ACKNOWLEDGEMENT - SSH on mc2019.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:09] ACKNOWLEDGEMENT - Host mc2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696 [09:02:09] ACKNOWLEDGEMENT - SSH on kubestage2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:09] ACKNOWLEDGEMENT - Host kubestage2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696 [09:02:09] ACKNOWLEDGEMENT - SSH on es2026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:10] ACKNOWLEDGEMENT - Host es2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696 [09:02:10] ACKNOWLEDGEMENT - SSH on db2136.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:11] ACKNOWLEDGEMENT - Host db2136.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696 [09:02:11] ACKNOWLEDGEMENT - SSH on db2075.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:12] ACKNOWLEDGEMENT - Host db2075.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303696 [09:02:35] XioNoX: thanks I was wondering what was happening [09:03:14] ACKNOWLEDGEMENT - Juniper alarms on asw-a-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:03:14] ACKNOWLEDGEMENT - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - 6 red alarms, 0 yellow alarms ayounsi https://phabricator.wikimedia.org/T303696 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:03:33] elukey: I think more happened, seeing all those "SSH" alerts [09:03:41] (KubernetesCalicoDown) resolved: kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [09:04:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2017.codfw.wmnet with reason: host reimage [09:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:11] (KubernetesCalicoDown) firing: kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [09:08:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298563)', diff saved to https://phabricator.wikimedia.org/P22389 and previous config saved to /var/cache/conftool/dbconfig/20220314-090830-marostegui.json [09:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:35] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [09:09:50] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:10:11] (KubernetesCalicoDown) resolved: kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [09:10:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P22390 and previous config saved to /var/cache/conftool/dbconfig/20220314-091046-marostegui.json [09:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:10] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:11:11] (KubernetesCalicoDown) firing: kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [09:15:28] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:16:11] (KubernetesCalicoDown) resolved: kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [09:17:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2017.codfw.wmnet with OS bullseye [09:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:10] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: reimage of puppet servers can fail - https://phabricator.wikimedia.org/T235067 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff We can close this task and there's no Puppet server specific change needed. There have been various... [09:18:37] !log installing vim security updates [09:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:41] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10MoritzMuehlenhoff) [09:23:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P22391 and previous config saved to /var/cache/conftool/dbconfig/20220314-092335-marostegui.json [09:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298294)', diff saved to https://phabricator.wikimedia.org/P22392 and previous config saved to /var/cache/conftool/dbconfig/20220314-092551-marostegui.json [09:25:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1142.eqiad.wmnet with reason: Maintenance [09:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1142.eqiad.wmnet with reason: Maintenance [09:25:55] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [09:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298294)', diff saved to https://phabricator.wikimedia.org/P22393 and previous config saved to /var/cache/conftool/dbconfig/20220314-092559-marostegui.json [09:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:34] 10SRE-swift-storage: Bring ms-fe10[09-12] into service - https://phabricator.wikimedia.org/T303698 (10MatthewVernon) [09:31:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [09:34:02] (03CR) 10Btullis: [C: 03+2] Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516) (owner: 10Btullis) [09:37:22] (03PS3) 10Btullis: Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516) [09:37:40] (03PS1) 10MVernon: swift: add new proxies as proxyhosts, memcached_servers, conftool [puppet] - 10https://gerrit.wikimedia.org/r/770452 (https://phabricator.wikimedia.org/T303698) [09:38:05] (03CR) 10jerkins-bot: [V: 04-1] Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516) (owner: 10Btullis) [09:38:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P22394 and previous config saved to /var/cache/conftool/dbconfig/20220314-093840-marostegui.json [09:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:29] (03PS4) 10Btullis: Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516) [09:41:45] (03CR) 10Marostegui: [C: 03+1] swift: add new proxies as proxyhosts, memcached_servers, conftool [puppet] - 10https://gerrit.wikimedia.org/r/770452 (https://phabricator.wikimedia.org/T303698) (owner: 10MVernon) [09:45:07] (03PS4) 10Btullis: Fix the prometheus elasticsearch exporter on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) [09:46:11] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34234/console" [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) (owner: 10Btullis) [09:46:25] !log dbmaint on s1@eqiad (T298743) [09:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:29] !log dbmaint on s8@eqiad (T298743) [09:46:29] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [09:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:47] (03CR) 10Btullis: [C: 03+2] Enable production shell access for Njideka Okafor [puppet] - 10https://gerrit.wikimedia.org/r/769969 (https://phabricator.wikimedia.org/T303516) (owner: 10Btullis) [09:47:51] (03CR) 10MVernon: [C: 03+2] swift: add new proxies as proxyhosts, memcached_servers, conftool [puppet] - 10https://gerrit.wikimedia.org/r/770452 (https://phabricator.wikimedia.org/T303698) (owner: 10MVernon) [09:48:35] btullis: there's a puppet change waiting for merge "Enable production shell access for Njideka Okafor"; OK to merge? [09:48:46] !log dbmaint on s2@eqiad (T298743) [09:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:20] Emperor: please merge that one. I got a lock error running puppet-merge at the same time as you :-) [09:49:54] done, thanks :) [09:50:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298294)', diff saved to https://phabricator.wikimedia.org/P22395 and previous config saved to /var/cache/conftool/dbconfig/20220314-095009-marostegui.json [09:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:13] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [09:51:35] (03PS8) 10Giuseppe Lavagetto: C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond) [09:53:15] (03CR) 10Vgutierrez: [C: 03+1] C:varnish: use X-Public-Cloud to store the cloud provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond) [09:53:26] !log rebooting ms-fe10[09-12] as part of bringing into service T303698 [09:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:31] T303698: Bring ms-fe10[09-12] into service - https://phabricator.wikimedia.org/T303698 [09:53:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298563)', diff saved to https://phabricator.wikimedia.org/P22396 and previous config saved to /var/cache/conftool/dbconfig/20220314-095346-marostegui.json [09:53:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:53:49] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [09:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298563)', diff saved to https://phabricator.wikimedia.org/P22397 and previous config saved to /var/cache/conftool/dbconfig/20220314-095353-marostegui.json [09:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:16] (03PS1) 10Volans: Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 [09:55:02] PROBLEM - Host ms-fe1011 is DOWN: PING CRITICAL - Packet loss = 100% [09:55:38] RECOVERY - Host ms-fe1011 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [09:57:25] (03CR) 10jerkins-bot: [V: 04-1] Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans) [09:57:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond) [09:59:42] (03CR) 10Hashar: "Looks good. We can do the same for the CI machine modules/profile/manifests/ci/httpd.pp :)" [puppet] - 10https://gerrit.wikimedia.org/r/769718 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:59:46] (03CR) 10Hashar: [C: 03+1] Enable profile::auto_restarts::service for apache/doc [puppet] - 10https://gerrit.wikimedia.org/r/769718 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:03:18] (03PS1) 10Elukey: Set overlayfs + bullseye for kubernetes2005 [puppet] - 10https://gerrit.wikimedia.org/r/770459 (https://phabricator.wikimedia.org/T300744) [10:03:25] (03CR) 10Btullis: [V: 03+1] Fix the prometheus elasticsearch exporter on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) (owner: 10Btullis) [10:03:30] (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix the prometheus elasticsearch exporter on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/770005 (https://phabricator.wikimedia.org/T303599) (owner: 10Btullis) [10:03:54] (03PS2) 10Volans: Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 [10:05:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P22398 and previous config saved to /var/cache/conftool/dbconfig/20220314-100515-marostegui.json [10:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:54] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [10:06:52] RECOVERY - Check systemd state on datahubsearch1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:53] (03PS2) 10Btullis: Add monitoring for the datahubsearch LVS service [puppet] - 10https://gerrit.wikimedia.org/r/769451 (https://phabricator.wikimedia.org/T301458) [10:06:59] (03PS1) 10Ladsgroup: Add 2022/change_transcode_T298743.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770461 (https://phabricator.wikimedia.org/T298743) [10:07:26] (03CR) 10JMeybohm: [C: 03+1] Set overlayfs + bullseye for kubernetes2005 [puppet] - 10https://gerrit.wikimedia.org/r/770459 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:07:34] RECOVERY - Check systemd state on datahubsearch1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:46] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Dale_Zhou - https://phabricator.wikimedia.org/T303702 (10MGerlach) [10:08:04] RECOVERY - Check systemd state on datahubsearch1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:17] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Dale_Zhou - https://phabricator.wikimedia.org/T303702 (10MGerlach) [10:10:17] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for ShubhankarP - https://phabricator.wikimedia.org/T303703 (10MGerlach) [10:10:59] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for ShubhankarP - https://phabricator.wikimedia.org/T303703 (10MGerlach) [10:12:10] (03CR) 10Marostegui: [C: 03+1] Add 2022/change_transcode_T298743.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770461 (https://phabricator.wikimedia.org/T298743) (owner: 10Ladsgroup) [10:13:30] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34237/console" [puppet] - 10https://gerrit.wikimedia.org/r/769451 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [10:16:06] (03CR) 10Ladsgroup: [C: 03+2] Add 2022/change_transcode_T298743.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770461 (https://phabricator.wikimedia.org/T298743) (owner: 10Ladsgroup) [10:17:08] (03Merged) 10jenkins-bot: Add 2022/change_transcode_T298743.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770461 (https://phabricator.wikimedia.org/T298743) (owner: 10Ladsgroup) [10:17:27] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add monitoring for the datahubsearch LVS service [puppet] - 10https://gerrit.wikimedia.org/r/769451 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [10:19:01] (03CR) 10Muehlenhoff: [C: 03+2] Remove cumin2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/769712 (https://phabricator.wikimedia.org/T303399) (owner: 10Muehlenhoff) [10:19:53] (03CR) 10Elukey: [C: 03+2] Set overlayfs + bullseye for kubernetes2005 [puppet] - 10https://gerrit.wikimedia.org/r/770459 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:20:08] 10ops-codfw, 10decommission-hardware: decommission cumin2001 - https://phabricator.wikimedia.org/T303399 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03Papaul [10:20:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P22399 and previous config saved to /var/cache/conftool/dbconfig/20220314-102020-marostegui.json [10:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:21] <_joe_> !log running puppet on all cp hosts, to introduce the cloud netmapping [10:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:59] ACKNOWLEDGEMENT - LVS datahubsearch eqiad port 9200/tcp - Search cluster serving DataHub IPv4 on datahubsearch.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.71 and port 443: Connection refused Btullis Investigating. T301458 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:31:11] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for apache/doc [puppet] - 10https://gerrit.wikimedia.org/r/769718 (https://phabricator.wikimedia.org/T135991) [10:31:26] (KubernetesCalicoDown) firing: kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:33:07] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/770464 [10:35:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298294)', diff saved to https://phabricator.wikimedia.org/P22400 and previous config saved to /var/cache/conftool/dbconfig/20220314-103525-marostegui.json [10:35:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1143.eqiad.wmnet with reason: Maintenance [10:35:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1143.eqiad.wmnet with reason: Maintenance [10:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:30] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [10:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298294)', diff saved to https://phabricator.wikimedia.org/P22401 and previous config saved to /var/cache/conftool/dbconfig/20220314-103532-marostegui.json [10:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298563)', diff saved to https://phabricator.wikimedia.org/P22402 and previous config saved to /var/cache/conftool/dbconfig/20220314-103602-marostegui.json [10:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:06] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [10:36:17] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:37:30] (03CR) 10Klausman: [C: 03+1] Add cumin aliases for ml-etcd [puppet] - 10https://gerrit.wikimedia.org/r/769730 (owner: 10Muehlenhoff) [10:37:34] (03CR) 10jerkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/770464 (owner: 10Hashar) [10:39:36] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for apache/doc [puppet] - 10https://gerrit.wikimedia.org/r/769718 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:40:16] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/770464 (owner: 10Hashar) [10:40:53] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:43:53] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:49:04] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for apache/CI [puppet] - 10https://gerrit.wikimedia.org/r/770467 (https://phabricator.wikimedia.org/T135991) [10:51:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P22403 and previous config saved to /var/cache/conftool/dbconfig/20220314-105107-marostegui.json [10:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298294)', diff saved to https://phabricator.wikimedia.org/P22404 and previous config saved to /var/cache/conftool/dbconfig/20220314-105749-marostegui.json [10:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:54] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [10:59:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:59:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:47] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:00:47] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:01:26] (KubernetesCalicoDown) resolved: kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:03:10] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@594f1d5] (codfw): Revert "Revert "Mirror 100% of request to tegola in eqiad"" [11:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:41] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@594f1d5] (codfw): Revert "Revert "Mirror 100% of request to tegola in eqiad"" (duration: 01m 30s) [11:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:50] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@594f1d5] (eqiad): Revert "Revert "Mirror 100% of request to tegola in eqiad"" [11:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P22405 and previous config saved to /var/cache/conftool/dbconfig/20220314-110612-marostegui.json [11:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:17] (03PS2) 10Giuseppe Lavagetto: P:cache::base: add netmapper file for abuse networks [puppet] - 10https://gerrit.wikimedia.org/r/769899 (https://phabricator.wikimedia.org/T302471) [11:11:50] (03PS1) 10Btullis: Update the monitoring check for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/770471 (https://phabricator.wikimedia.org/T301458) [11:12:05] (03CR) 10Vgutierrez: [C: 03+1] P:cache::base: add netmapper file for abuse networks [puppet] - 10https://gerrit.wikimedia.org/r/769899 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [11:12:52] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@594f1d5] (eqiad): Revert "Revert "Mirror 100% of request to tegola in eqiad"" (duration: 07m 01s) [11:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P22406 and previous config saved to /var/cache/conftool/dbconfig/20220314-111255-marostegui.json [11:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:10] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34241/console" [puppet] - 10https://gerrit.wikimedia.org/r/770471 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [11:15:33] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@c8a9efd] (eqiad): Enable mirroring on eqiad with 50% of the traffic [11:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:12] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@c8a9efd] (eqiad): Enable mirroring on eqiad with 50% of the traffic (duration: 02m 38s) [11:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:23] (03CR) 10Elukey: "LGTM! I just left a nit that is probably me not understanding the code, feel free to review it and in case consider my review a +1." [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans) [11:18:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] P:cache::base: add netmapper file for abuse networks [puppet] - 10https://gerrit.wikimedia.org/r/769899 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [11:18:59] (03CR) 10Jelto: [V: 03+1] "This change adds additional firewall rules to Trusted GitLab Runners. By default they reject all outgoing docker tcp traffic to 10.0.0.0/8" [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [11:21:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298563)', diff saved to https://phabricator.wikimedia.org/P22407 and previous config saved to /var/cache/conftool/dbconfig/20220314-112117-marostegui.json [11:21:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:21:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:23] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [11:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:26] (03CR) 10Btullis: [V: 03+1 C: 03+2] Update the monitoring check for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/770471 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [11:28:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P22408 and previous config saved to /var/cache/conftool/dbconfig/20220314-112759-marostegui.json [11:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:25] (03PS2) 10Muehlenhoff: Add cumin aliases for ml-etcd [puppet] - 10https://gerrit.wikimedia.org/r/769730 [11:40:58] (03CR) 10Muehlenhoff: [C: 03+2] Add cumin aliases for ml-etcd [puppet] - 10https://gerrit.wikimedia.org/r/769730 (owner: 10Muehlenhoff) [11:43:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298294)', diff saved to https://phabricator.wikimedia.org/P22409 and previous config saved to /var/cache/conftool/dbconfig/20220314-114305-marostegui.json [11:43:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [11:43:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [11:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:10] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [11:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298294)', diff saved to https://phabricator.wikimedia.org/P22410 and previous config saved to /var/cache/conftool/dbconfig/20220314-114312-marostegui.json [11:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:24] (03PS1) 10Muehlenhoff: Add Cumin alias for cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/770473 [11:50:53] (03PS1) 10Vgutierrez: varnish::tests: Basic X-Public-Cloud test [puppet] - 10https://gerrit.wikimedia.org/r/770474 [11:51:46] (03PS1) 10Btullis: Add single quotes around the regex to use [puppet] - 10https://gerrit.wikimedia.org/r/770475 (https://phabricator.wikimedia.org/T301458) [11:53:27] !log restarting apache2 on matomo1002 to pick up security updates [11:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:49] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34243/console" [puppet] - 10https://gerrit.wikimedia.org/r/770475 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [11:54:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] C:varnish: load abuse_networks.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769900 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [11:54:11] (03PS2) 10Giuseppe Lavagetto: C:varnish: load abuse_networks.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769900 (https://phabricator.wikimedia.org/T302471) [11:55:41] !log restarting nginx on archiva1002 to pick up security updates [11:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:49] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add single quotes around the regex to use [puppet] - 10https://gerrit.wikimedia.org/r/770475 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [11:58:13] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) [11:58:43] (03PS1) 10Vgutierrez: Fix fe_ratelimit injection stub [labs/private] - 10https://gerrit.wikimedia.org/r/770476 (https://phabricator.wikimedia.org/T303534) [12:00:15] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] Fix fe_ratelimit injection stub [labs/private] - 10https://gerrit.wikimedia.org/r/770476 (https://phabricator.wikimedia.org/T303534) (owner: 10Vgutierrez) [12:03:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298294)', diff saved to https://phabricator.wikimedia.org/P22411 and previous config saved to /var/cache/conftool/dbconfig/20220314-120347-marostegui.json [12:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:52] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [12:11:58] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) LDAP membership of the `wmf` groups has been added in T303512 I have created the kerberos principal. ` btullis@krb1001:~$ sudo ma... [12:13:24] (03PS2) 10Vgutierrez: varnish::tests: Basic X-Public-Cloud test [puppet] - 10https://gerrit.wikimedia.org/r/770474 [12:15:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [12:17:13] (03PS1) 10Joal: Update hadoop net-toplogy.sh script [puppet] - 10https://gerrit.wikimedia.org/r/770487 [12:18:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P22412 and previous config saved to /var/cache/conftool/dbconfig/20220314-121852-marostegui.json [12:18:53] joal lgtm ^ should I merge? [12:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:01] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [12:19:02] ottomata: please :) [12:19:10] (03CR) 10Ottomata: [C: 03+2] Update hadoop net-toplogy.sh script [puppet] - 10https://gerrit.wikimedia.org/r/770487 (owner: 10Joal) [12:19:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [12:19:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [12:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298563)', diff saved to https://phabricator.wikimedia.org/P22413 and previous config saved to /var/cache/conftool/dbconfig/20220314-121937-marostegui.json [12:19:39] thanks a lot ottomata :) [12:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:41] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [12:26:00] PROBLEM - SSH on analytics1067.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:26:04] (03PS3) 10Vgutierrez: varnish::tests: Basic X-Public-Cloud test [puppet] - 10https://gerrit.wikimedia.org/r/770474 [12:33:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P22414 and previous config saved to /var/cache/conftool/dbconfig/20220314-123357-marostegui.json [12:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:32] (03PS1) 10Gergő Tisza: Stop using huwiki 500k milestone logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770496 (https://phabricator.wikimedia.org/T301923) [12:36:34] (03PS1) 10Gergő Tisza: Delete huwiki 500k milestone logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770497 (https://phabricator.wikimedia.org/T301923) [12:37:43] (03PS4) 10Vgutierrez: varnish::tests: Basic X-Public-Cloud test [puppet] - 10https://gerrit.wikimedia.org/r/770474 [12:42:46] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:46:00] (03PS5) 10Vgutierrez: varnish::tests: Basic X-Public-Cloud test [puppet] - 10https://gerrit.wikimedia.org/r/770474 [12:46:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add Cumin alias for cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/770473 (owner: 10Muehlenhoff) [12:49:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298294)', diff saved to https://phabricator.wikimedia.org/P22415 and previous config saved to /var/cache/conftool/dbconfig/20220314-124902-marostegui.json [12:49:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1147.eqiad.wmnet with reason: Maintenance [12:49:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1147.eqiad.wmnet with reason: Maintenance [12:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:08] (03CR) 10Vgutierrez: "vgutierrez@carrot:~/wikimedia.org/operations/puppet/modules/varnish/files/tests$ cat /tmp/vtcresults.temdsksHz8" [puppet] - 10https://gerrit.wikimedia.org/r/770474 (owner: 10Vgutierrez) [12:49:08] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [12:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298294)', diff saved to https://phabricator.wikimedia.org/P22416 and previous config saved to /var/cache/conftool/dbconfig/20220314-124911-marostegui.json [12:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:22] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) Hi CDanis! Would it be possible to also update status.wikimedia.org to redi... [12:58:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298563)', diff saved to https://phabricator.wikimedia.org/P22417 and previous config saved to /var/cache/conftool/dbconfig/20220314-125839-marostegui.json [12:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:44] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [13:00:04] (03PS3) 10Volans: Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220314T1300). [13:00:05] zabe and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:15] o/ [13:00:50] (03CR) 10Volans: "fixed bug reported in comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans) [13:01:08] Hey! I can deploy today (unless tgr wishes to!) [13:01:27] thx urbanecm [13:02:48] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/770473 (owner: 10Muehlenhoff) [13:04:22] tgr: why is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/770496 changing static/images/project-logos/huwiki-1.5x.png please? [13:07:16] (03CR) 10Urbanecm: [C: 03+2] "I confirm `labweb1001/etc/mediawiki/WikitechPrivateSettings.php` has the new variable names, looks good otherwise, should work" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769750 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:07:56] (03Merged) 10jenkins-bot: wikitech: migrate wmf* to wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769750 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:09:05] zabe: I'm going to sync it because changes to wikitech.php can't be tested via mwdebug1001. Can you monitor the error logs for a while please? [13:09:14] yes [13:10:34] hm, not sure. It was generated by tox. The difference seems pretty significant. [13:10:47] I'll just revert it. [13:10:51] thank you [13:10:58] !log urbanecm@deploy1002 Synchronized wmf-config/wikitech.php: 95f376a: wikitech: migrate wmf* to wmg* (T45956) (duration: 00m 48s) [13:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:02] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [13:11:09] it might be the commons file slightly changed, or different set of optimizers was used, not sure [13:11:35] (03PS2) 10Gergő Tisza: Stop using huwiki 500k milestone logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770496 (https://phabricator.wikimedia.org/T301923) [13:11:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298294)', diff saved to https://phabricator.wikimedia.org/P22418 and previous config saved to /var/cache/conftool/dbconfig/20220314-131220-marostegui.json [13:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:24] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [13:12:37] (03PS2) 10Gergő Tisza: Delete huwiki 500k milestone logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770497 (https://phabricator.wikimedia.org/T301923) [13:12:55] tgr_: are you still here? my client shows the tgr nick quit [13:13:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:13:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P22419 and previous config saved to /var/cache/conftool/dbconfig/20220314-131344-marostegui.json [13:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:52] (03CR) 10Urbanecm: [C: 03+1] Stop using huwiki 500k milestone logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770496 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza) [13:14:05] urbanecm: sorry, my bouncer is being difficult. [13:14:55] tgr_: no problem -- I can sync it w/o tests on your end if you want, as it's a pretty trivial change. [13:14:58] (03CR) 10Urbanecm: [C: 03+2] Stop using huwiki 500k milestone logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770496 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza) [13:15:52] (03Merged) 10jenkins-bot: Stop using huwiki 500k milestone logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770496 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza) [13:16:45] works fine for me, syncing [13:17:07] (03PS3) 10Urbanecm: Delete huwiki 500k milestone logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770497 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza) [13:17:11] (03CR) 10Urbanecm: [C: 03+2] Delete huwiki 500k milestone logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770497 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza) [13:17:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:45] (03CR) 10Andrew Bogott: [C: 03+2] P:toolforge::static: publish SSH fingerprints under /admin [puppet] - 10https://gerrit.wikimedia.org/r/766292 (owner: 10Majavah) [13:17:54] (03Merged) 10jenkins-bot: Delete huwiki 500k milestone logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770497 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza) [13:18:22] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: 3c2c8b0cca4e48f572abd3812594097a33e64379: Stop using huwiki 500k milestone logo (T301923) (duration: 00m 48s) [13:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:25] T301923: Enable milestone logo for hu.wikipedia - 500K articles - https://phabricator.wikimedia.org/T301923 [13:20:28] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: 3fa9683: Delete huwiki 500k milestone logo files (T301923) (duration: 00m 49s) [13:20:31] tgr_: should be all live! [13:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:46] Thanks! I suppose pages have to age out of varnish for the change to take effect? [13:20:53] I see it on some pages but not all. [13:21:55] it should be much a shorter cache, it's just a different URI is in the CSS for background-image [13:22:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:46] (03CR) 10MMandere: [C: 03+1] "marc@stark:~/Projects/puppet/modules/varnish/files/tests$ cat /tmp/vtcresults.sLoTXC4t8E" [puppet] - 10https://gerrit.wikimedia.org/r/770474 (owner: 10Vgutierrez) [13:23:33] (03CR) 10Majavah: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/770102 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:23:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:23:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:05] !log restarting blazegraph on wdqs1006 (jvm stuck for 10hours) [13:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P22420 and previous config saved to /var/cache/conftool/dbconfig/20220314-132726-marostegui.json [13:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P22421 and previous config saved to /var/cache/conftool/dbconfig/20220314-132849-marostegui.json [13:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:02] urbanecm: we are well beyond the 5min ResourceLoader expiry now and the old logo is still present. In any case, just a curiosity, it's not a problem if it lingers for a while (delays when enabling the milestone logo would be worse, but I don't remember seeing that). Thanks again! [13:31:55] that's weird. i don't see it at all, but i also don't visit huwiki frequently, so that might well be it [13:32:02] (although now that I said that, I don't see it anymore. Maybe it was 10m?) [13:34:16] (03PS1) 10Gergő Tisza: Add a note about tox requirements for changing logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770502 [13:34:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:09] (03CR) 10Urbanecm: [C: 03+1] Add a note about tox requirements for changing logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770502 (owner: 10Gergő Tisza) [13:36:20] !log restarting swift-proxy on ms-fe100[5-8] to update config to know about new eqiad frontends T303698 [13:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:23] T303698: Bring ms-fe10[09-12] into service - https://phabricator.wikimedia.org/T303698 [13:39:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:39:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P22422 and previous config saved to /var/cache/conftool/dbconfig/20220314-134231-marostegui.json [13:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:03] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [13:43:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1016.eqiad.wmnet with OS b... [13:43:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298563)', diff saved to https://phabricator.wikimedia.org/P22423 and previous config saved to /var/cache/conftool/dbconfig/20220314-134356-marostegui.json [13:43:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:43:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:00] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [13:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:24] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1017.eqiad.wmnet with OS bullseye [13:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:41] (03PS5) 10Herron: envoy: manage strip_matching_host_port setting and enable on thanos-fe [puppet] - 10https://gerrit.wikimedia.org/r/769749 (https://phabricator.wikimedia.org/T300119) [13:45:28] !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1009.eqiad.wmnet [13:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:40] !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1009.eqiad.wmnet [13:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:49] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1009.eqiad.wmnet [13:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:57] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1009.eqiad.wmnet [13:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:55] !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1012.eqiad.wmnet [13:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:12] !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1012.eqiad.wmnet [13:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:18] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1012.eqiad.wmnet [13:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:23] 10SRE, 10Infrastructure-Foundations: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725 (10CDanis) [13:50:23] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1012.eqiad.wmnet [13:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:43] 10SRE, 10Infrastructure-Foundations: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725 (10CDanis) p:05Triage→03Low [13:51:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ayounsi) I came across 3 planned parse servers in rack C8, https://netbox.wikimedia.org/dcim/devices/?q=&rack_id=24&role=server As a reminder, C8 and D5 are dedica... [13:52:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:52:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:52] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1016.eqiad.wmnet with reason: host reimage [13:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:27] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1017.eqiad.wmnet with reason: host reimage [13:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:06] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:26] (03PS15) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) [13:56:48] !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1011.eqiad.wmnet [13:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:55] !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1011.eqiad.wmnet [13:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:02] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1011.eqiad.wmnet [13:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:08] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1011.eqiad.wmnet [13:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:13] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) @lmata yeah, sorry, that's been on my backlog but I had been putting it off... [13:57:19] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1016.eqiad.wmnet with reason: host reimage [13:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298294)', diff saved to https://phabricator.wikimedia.org/P22424 and previous config saved to /var/cache/conftool/dbconfig/20220314-135736-marostegui.json [13:57:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance [13:57:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance [13:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:40] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [13:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298294)', diff saved to https://phabricator.wikimedia.org/P22425 and previous config saved to /var/cache/conftool/dbconfig/20220314-135744-marostegui.json [13:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:31] !log grafana1002:~# systemctl restart grafana-ldap-users-sync.service T303064 [13:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:34] T303064: grafana-ldap-users-sync fails to finish intermittently - https://phabricator.wikimedia.org/T303064 [13:59:00] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34256/console" [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:59:57] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1017.eqiad.wmnet with reason: host reimage [13:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:42] (03PS1) 10JMeybohm: Remove LVS for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) [14:01:27] !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1010.eqiad.wmnet [14:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:33] !log mvernon@cumin1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1010.eqiad.wmnet [14:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:39] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1010.eqiad.wmnet [14:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:47] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1010.eqiad.wmnet [14:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:07] (03CR) 10Jelto: [V: 03+1] gitlab_runner: restrict docker traffic with additional ferm rules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:04:14] (03CR) 10Volans: elastic: relax & restore perms during upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/769109 (https://phabricator.wikimedia.org/T301955) (owner: 10Ryan Kemper) [14:05:48] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:54] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [14:06:38] (03CR) 10Elukey: [C: 03+1] Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans) [14:07:09] (03CR) 10Vgutierrez: [C: 03+2] varnish::tests: Basic X-Public-Cloud test [puppet] - 10https://gerrit.wikimedia.org/r/770474 (owner: 10Vgutierrez) [14:08:50] (03PS1) 10Ottomata: Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) [14:09:23] (03CR) 10jerkins-bot: [V: 04-1] Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) (owner: 10Ottomata) [14:10:08] (03PS2) 10Ottomata: Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) [14:10:39] (03CR) 10jerkins-bot: [V: 04-1] Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) (owner: 10Ottomata) [14:11:40] (03PS3) 10Ottomata: Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) [14:17:55] (03CR) 10Joal: Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) (owner: 10Ottomata) [14:20:15] (03PS1) 10JMeybohm: Move miscweb from it's own LVS VIP to k8s-ingress-wikikube [dns] - 10https://gerrit.wikimedia.org/r/770506 (https://phabricator.wikimedia.org/T290966) [14:22:05] (03PS2) 10JMeybohm: Remove LVS for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) [14:22:39] (03CR) 10Joal: [C: 03+1] "Adding a question to a question :)" [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [14:24:49] (03PS4) 10Ottomata: Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) [14:25:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298294)', diff saved to https://phabricator.wikimedia.org/P22426 and previous config saved to /var/cache/conftool/dbconfig/20220314-142502-marostegui.json [14:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:07] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [14:25:29] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34260/console" [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) (owner: 10Ottomata) [14:27:15] (03PS5) 10Ottomata: Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) [14:27:53] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34261/console" [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) (owner: 10Ottomata) [14:28:10] RECOVERY - SSH on analytics1067.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:31:44] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/pcc-worker1001/34255/" [puppet] - 10https://gerrit.wikimedia.org/r/769749 (https://phabricator.wikimedia.org/T300119) (owner: 10Herron) [14:32:03] (03CR) 10Volans: "LGTM (CI apart) not sure if worth waiting the changes into spicerack at this point." [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:32:08] (03PS2) 10Giuseppe Lavagetto: C:varnish: introduce the X-Abuse-Network request "header" [puppet] - 10https://gerrit.wikimedia.org/r/769901 (https://phabricator.wikimedia.org/T302471) [14:34:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] C:varnish: introduce the X-Abuse-Network request "header" [puppet] - 10https://gerrit.wikimedia.org/r/769901 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [14:34:59] (03CR) 10Ottomata: Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) (owner: 10Ottomata) [14:35:46] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10cmooney) @MatthewVernon Just to follow up having checked all network interfaces, forwarding tables and the end devices all looks to be working fine with ms-f... [14:38:15] (03PS1) 10DCausse: [wdqs] adapt updateQueryServiceLag... [puppet] - 10https://gerrit.wikimedia.org/r/770508 (https://phabricator.wikimedia.org/T302494) [14:39:55] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) A possibile way forward is to modify https://gerrit.wikimedia.org/r/c/operations/puppet/+/763113 to avoid the profile::base::certificates profile, and modify the c... [14:40:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P22427 and previous config saved to /var/cache/conftool/dbconfig/20220314-144007-marostegui.json [14:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10akosiaris) Replying instead of Daniel, he is currently unavailable. @Cmjohnson, I guess rows E & F are ok, I think it will be the first stuff we will be operating... [14:43:16] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:45:46] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:46:02] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10MatthewVernon) 05Open→03Resolved Great, thanks. I think we can close this now :) [14:47:27] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/770003 (owner: 10Jbond) [14:47:34] (03CR) 10Ottomata: Standardize the stats system user uid (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [14:47:42] (03PS3) 10Ottomata: Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) [14:47:48] 10SRE-swift-storage: Bring ms-fe10[09-12] into service - https://phabricator.wikimedia.org/T303698 (10MatthewVernon) 05Open→03Resolved All online OK. [14:50:19] 10SRE-swift-storage: Decommission ms-fe100[5-8] - https://phabricator.wikimedia.org/T303733 (10MatthewVernon) [14:50:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [14:51:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [14:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298563)', diff saved to https://phabricator.wikimedia.org/P22428 and previous config saved to /var/cache/conftool/dbconfig/20220314-145109-marostegui.json [14:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:13] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [14:52:45] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) >>! In T300130#7774198, @elukey wrote: > A possibile way forward is to modify https://gerrit.wikimedia.org/r/c/operations/puppet/+/763113 to avoid the profile::... [14:53:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance [14:53:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance [14:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T300775)', diff saved to https://phabricator.wikimedia.org/P22429 and previous config saved to /var/cache/conftool/dbconfig/20220314-145345-marostegui.json [14:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:49] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [14:55:09] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1016.eqiad.wmnet with OS bullseye [14:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P22430 and previous config saved to /var/cache/conftool/dbconfig/20220314-145512-marostegui.json [14:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1016.eqiad.wmnet with OS bulls... [14:57:12] (03PS1) 10David Caro: [buildservice] Add a cookbook to update the needed images [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770519 (https://phabricator.wikimedia.org/T297090) [14:57:35] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1017.eqiad.wmnet with OS bullseye [14:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:14] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) >>! In T300130#7774257, @colewhite wrote: >>>! In T300130#7774198, @elukey wrote: >> A possibile way forward is to modify https://gerrit.wikimedia.org/r/c/operatio... [15:01:09] (03CR) 10jerkins-bot: [V: 04-1] [buildservice] Add a cookbook to update the needed images [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770519 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [15:10:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298294)', diff saved to https://phabricator.wikimedia.org/P22431 and previous config saved to /var/cache/conftool/dbconfig/20220314-151017-marostegui.json [15:10:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance [15:10:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance [15:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:22] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [15:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T298294)', diff saved to https://phabricator.wikimedia.org/P22432 and previous config saved to /var/cache/conftool/dbconfig/20220314-151025-marostegui.json [15:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:38] (03PS1) 10Klausman: Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) [15:15:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:15:14] 10SRE, 10ops-codfw: codfw A1 power outage - https://phabricator.wikimedia.org/T303696 (10Papaul) TICKET NO. 2213827 U open with CY1 [15:15:42] (03CR) 10Muehlenhoff: Standardize the stats system user uid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [15:19:39] (03PS1) 104nn1l2: liwiktionary: Change timezone to CET/CEST [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770523 (https://phabricator.wikimedia.org/T303734) [15:24:32] (03PS4) 10Ottomata: Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) [15:24:39] PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:24:39] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:24:41] PROBLEM - Host asw-a-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:24:56] WUT? XioNoX ^^^ [15:25:15] PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:25:22] papaul: PDU? ^ [15:25:43] volans: I'd guess it's just the mgmt interface otherwise we would get much more noise [15:25:52] XioNoX: yes i am replacing the PDU [15:25:54] unless they are propertly set in icinga [15:26:13] why different rows are in the same pdu? which pdu did fail? [15:26:49] PROBLEM - Host ripe-atlas-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:26:53] ps1-a1 [15:27:39] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:28:16] ah mr1, got it [15:28:57] PROBLEM - Host mr1-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:30:04] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220314T1530). [15:30:11] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:30:15] (JobUnavailable) firing: (3) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [15:30:37] (03PS5) 10Ottomata: Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) [15:31:13] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:31:17] (03CR) 10jerkins-bot: [V: 04-1] Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [15:32:38] (03PS6) 10Ottomata: Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) [15:33:10] (03CR) 10jerkins-bot: [V: 04-1] Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [15:34:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298563)', diff saved to https://phabricator.wikimedia.org/P22434 and previous config saved to /var/cache/conftool/dbconfig/20220314-153428-marostegui.json [15:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:33] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [15:34:44] (03PS2) 10Klausman: Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) [15:34:56] (03PS7) 10Ottomata: Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) [15:35:20] (03CR) 10jerkins-bot: [V: 04-1] Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman) [15:35:32] (03CR) 10jerkins-bot: [V: 04-1] Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [15:35:46] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34266/console" [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [15:36:28] (03PS2) 10Zabe: Migrate wmfDatacenter(s) to wmgDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768254 (https://phabricator.wikimedia.org/T45956) [15:36:38] (03CR) 10Muehlenhoff: Standardize the stats system user uid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [15:38:27] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:38:29] RECOVERY - Juniper alarms on cr1-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:39:28] (03CR) 10Ottomata: [V: 03+1] Standardize the stats system user uid (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [15:39:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298294)', diff saved to https://phabricator.wikimedia.org/P22435 and previous config saved to /var/cache/conftool/dbconfig/20220314-153945-marostegui.json [15:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:49] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [15:39:55] XioNoX: the new PDU is in place it should clear all the alarm now [15:40:10] nice! [15:40:11] (03PS3) 10Klausman: Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) [15:40:47] RECOVERY - Host ripe-atlas-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 31.78 ms [15:40:53] (03CR) 10jerkins-bot: [V: 04-1] Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman) [15:42:28] (03PS4) 10Klausman: Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) [15:44:47] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 202 probes of 671 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:46:00] RECOVERY - Host kubestage2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.40 ms [15:46:01] RECOVERY - Host db2136.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.22 ms [15:46:01] RECOVERY - Host mc2019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 32.00 ms [15:46:02] RECOVERY - Host ml-serve2005.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 42.85 ms [15:46:02] RECOVERY - Host re0.cr1-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.52 ms [15:46:02] RECOVERY - Host scs-a1-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.65 ms [15:46:04] RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.94 ms [15:46:04] RECOVERY - Host asw-a-codfw is UP: PING OK - Packet loss = 0%, RTA = 40.79 ms [15:46:04] RECOVERY - Host es2026.mgmt is UP: PING WARNING - Packet loss = 66%, RTA = 45.37 ms [15:46:08] RECOVERY - Host db2075.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 46.78 ms [15:46:40] RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [15:47:00] RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms [15:47:02] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms [15:47:22] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:47:33] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34267/console" [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [15:48:16] (03PS1) 10Klausman: Add DNS SRV records for ML staging etcd in codfw [dns] - 10https://gerrit.wikimedia.org/r/770529 [15:48:24] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:48:30] RECOVERY - Juniper alarms on asw-a-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:49:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P22436 and previous config saved to /var/cache/conftool/dbconfig/20220314-154933-marostegui.json [15:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:02] RECOVERY - Host mr1-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 33.46 ms [15:51:36] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 62 probes of 671 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:53:28] 10SRE, 10Security-Team, 10Performance-Team (Radar), 10SecTeam-Processed, 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10sbassett) 05Open→03Resolved Per the last recommendation from @Joe at T301428#7730915, we've decided to pursue MySQL/Maria as the primary backe... [15:53:42] RECOVERY - IPMI Sensor Status on db2075 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:53:42] RECOVERY - IPMI Sensor Status on es2026 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:53:42] RECOVERY - IPMI Sensor Status on ml-serve2005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:53:44] RECOVERY - IPMI Sensor Status on mc2019 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:54:10] (JobUnavailable) firing: (3) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [15:54:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P22437 and previous config saved to /var/cache/conftool/dbconfig/20220314-155450-marostegui.json [15:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [15:55:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] vrts: rename mail module class variables [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [15:56:54] (03CR) 10Jelto: [V: 03+1 C: 03+1] "looks good to me, minor suggestion in a comment" [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [16:03:25] (03PS1) 10Ebernhardson: Cut saneitizer re-indexing rate in half [extensions/CirrusSearch] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770056 (https://phabricator.wikimedia.org/T302733) [16:04:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P22438 and previous config saved to /var/cache/conftool/dbconfig/20220314-160438-marostegui.json [16:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:42] (03CR) 10Elukey: [C: 03+1] Add DNS SRV records for ML staging etcd in codfw [dns] - 10https://gerrit.wikimedia.org/r/770529 (owner: 10Klausman) [16:09:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P22439 and previous config saved to /var/cache/conftool/dbconfig/20220314-160955-marostegui.json [16:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:56] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [16:18:21] (03CR) 10JMeybohm: [C: 03+1] Add DNS SRV records for ML staging etcd in codfw [dns] - 10https://gerrit.wikimedia.org/r/770529 (owner: 10Klausman) [16:19:44] (03CR) 10Klausman: [C: 03+2] Add DNS SRV records for ML staging etcd in codfw [dns] - 10https://gerrit.wikimedia.org/r/770529 (owner: 10Klausman) [16:19:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298563)', diff saved to https://phabricator.wikimedia.org/P22440 and previous config saved to /var/cache/conftool/dbconfig/20220314-161943-marostegui.json [16:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [16:19:48] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [16:19:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [16:19:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 6 hosts with reason: Maintenance [16:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 6 hosts with reason: Maintenance [16:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:24] (03CR) 10Volans: [C: 03+1] "LGTM, couple of questions/comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [16:21:40] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:38] RECOVERY - Host ps1-a1-codfw is UP: PING OK - Packet loss = 0%, RTA = 35.26 ms [16:24:54] (03CR) 10Ahmon Dancy: "Just a typo nit. Otherwise I think I'll be able to work with this." [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto) [16:25:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298294)', diff saved to https://phabricator.wikimedia.org/P22441 and previous config saved to /var/cache/conftool/dbconfig/20220314-162501-marostegui.json [16:25:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance [16:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance [16:25:05] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [16:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T298294)', diff saved to https://phabricator.wikimedia.org/P22442 and previous config saved to /var/cache/conftool/dbconfig/20220314-162509-marostegui.json [16:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:56] (03PS1) 10David Caro: Refactor dologmsg [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770547 (https://phabricator.wikimedia.org/T297090) [16:26:03] (03PS1) 10David Caro: buildservice: Add some sal logs when updating the base images [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770548 (https://phabricator.wikimedia.org/T297090) [16:28:37] (03CR) 10jerkins-bot: [V: 04-1] Refactor dologmsg [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770547 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [16:28:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans) [16:28:46] (03PS2) 10Giuseppe Lavagetto: utils: add script to sync abuse networks with conftool ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/767489 (https://phabricator.wikimedia.org/T302471) [16:28:48] (03PS1) 10Giuseppe Lavagetto: conftool-data: add phabricator_abusers to ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/770551 [16:28:52] (03CR) 10jerkins-bot: [V: 04-1] buildservice: Add some sal logs when updating the base images [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770548 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [16:29:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool-data: add phabricator_abusers to ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/770551 (owner: 10Giuseppe Lavagetto) [16:39:39] 10SRE, 10LDAP, 10User-jbond: Migrate web services using LDAP authentication towards the readonly LDAP replicas - https://phabricator.wikimedia.org/T227650 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [16:39:44] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:40:08] (03CR) 10Volans: "one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/767489 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [16:46:17] (03PS1) 10Klausman: Add dummy key for ML staging etcd in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/770554 [16:47:17] (03PS1) 10Andrew Bogott: netboot: switch cloudvirt102[1-9] partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/770555 (https://phabricator.wikimedia.org/T281276) [16:48:20] 10SRE, 10Release Pipeline, 10serviceops, 10Goal, 10Release-Engineering-Team (Seen): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10akosiaris) [16:48:50] (03CR) 10Elukey: [C: 03+1] Add dummy key for ML staging etcd in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/770554 (owner: 10Klausman) [16:49:11] 10SRE, 10Release Pipeline, 10serviceops, 10Goal, 10Release-Engineering-Team (Seen): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10akosiaris) 05Open→03Resolved a:03akosiaris Resolving. Wikifeeds has been migrated, restrouter migration was cancelled, the process is d... [16:49:17] (03CR) 10Klausman: [C: 03+2] Add dummy key for ML staging etcd in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/770554 (owner: 10Klausman) [16:49:23] (03CR) 10Klausman: [V: 03+2 C: 03+2] Add dummy key for ML staging etcd in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/770554 (owner: 10Klausman) [16:49:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298294)', diff saved to https://phabricator.wikimedia.org/P22444 and previous config saved to /var/cache/conftool/dbconfig/20220314-164927-marostegui.json [16:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:32] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [16:50:15] (03CR) 10Andrew Bogott: [C: 03+2] netboot: switch cloudvirt102[1-9] partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/770555 (https://phabricator.wikimedia.org/T281276) (owner: 10Andrew Bogott) [16:52:00] (03CR) 10Bking: [C: 03+2] [wdqs] switch wdqs1010 to the streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/742670 (https://phabricator.wikimedia.org/T301108) (owner: 10DCausse) [16:52:47] (Device rebooted) firing: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org [17:00:05] ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220314T1700). [17:00:38] (03PS2) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902 [17:01:39] (03CR) 10jerkins-bot: [V: 04-1] C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto) [17:02:47] (Device rebooted) resolved: Device ps1-a1-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org [17:04:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P22445 and previous config saved to /var/cache/conftool/dbconfig/20220314-170432-marostegui.json [17:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:20] (03PS1) 10JMeybohm: Prevent allocation of nodePorts when ingress is used [deployment-charts] - 10https://gerrit.wikimedia.org/r/770556 (https://phabricator.wikimedia.org/T290966) [17:15:21] 10SRE, 10ops-codfw: codfw A1 power outage - https://phabricator.wikimedia.org/T303696 (10Papaul) 05Open→03Resolved Replaced the PDU with a spare one we had on site. [17:18:46] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:01] (03PS3) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902 [17:19:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P22446 and previous config saved to /var/cache/conftool/dbconfig/20220314-171937-marostegui.json [17:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:50] (03CR) 10jerkins-bot: [V: 04-1] C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto) [17:23:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:43] 10SRE, 10ops-codfw, 10decommission-hardware: decommission cumin2001 - https://phabricator.wikimedia.org/T303399 (10Papaul) [17:24:04] 10SRE, 10ops-codfw, 10decommission-hardware: decommission cumin2001 - https://phabricator.wikimedia.org/T303399 (10Papaul) 05Open→03Resolved complete [17:30:15] (JobUnavailable) resolved: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:31:32] (03PS1) 10WMDE-Fisch: Fix copy-paste mistake in template search widget [extensions/TemplateWizard] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770057 (https://phabricator.wikimedia.org/T303524) [17:32:41] (03PS5) 10Klausman: Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) [17:34:17] (03PS4) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902 [17:34:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298294)', diff saved to https://phabricator.wikimedia.org/P22448 and previous config saved to /var/cache/conftool/dbconfig/20220314-173442-marostegui.json [17:34:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:34:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:47] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [17:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:43] (03CR) 10Vgutierrez: [C: 04-1] "this breaks varnish/text/31-blocked-nets.vtc" [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto) [17:40:32] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10JayCano) I just wanted to confirm that I approve of this request and I'm available for any questions. Thank you! [17:44:38] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@63af538] (eqiad): Enable 100% traffic mirroring on eqiad [17:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:42] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@63af538] (eqiad): Enable 100% traffic mirroring on eqiad (duration: 01m 04s) [17:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:46] !log start of foreachwikiindblist all maintenance/refreshImageMetadata.php --force --verbose --mediatype=AUDIO --sleep 2 (T226311) [17:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:49] T226311: Some WebM video files are misdetected as audio files due to the MIME detector not scanning enough bytes - https://phabricator.wikimedia.org/T226311 [17:53:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:53:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T298294)', diff saved to https://phabricator.wikimedia.org/P22449 and previous config saved to /var/cache/conftool/dbconfig/20220314-175352-marostegui.json [17:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:59] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [17:54:13] (03CR) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto) [17:54:21] (03CR) 10RLazarus: [C: 03+1] utils: add script to sync abuse networks with conftool ipblocks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/767489 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [17:54:24] (03CR) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto) [17:54:45] (03PS5) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902 [18:00:25] (03PS6) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902 [18:03:55] (03PS1) 10Gerrit maintenance bot: Add guw to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/770565 (https://phabricator.wikimedia.org/T303727) [18:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [18:14:19] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1021.eqiad.wmnet with OS bullseye [18:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:36] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I just realized this approach doesn't work for abuse networks:" [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto) [18:16:06] (03CR) 10Zabe: [C: 03+1] Add guw to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/770565 (https://phabricator.wikimedia.org/T303727) (owner: 10Gerrit maintenance bot) [18:17:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298294)', diff saved to https://phabricator.wikimedia.org/P22450 and previous config saved to /var/cache/conftool/dbconfig/20220314-181709-marostegui.json [18:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:14] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [18:25:15] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1021.eqiad.wmnet with reason: host reimage [18:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:06] 10SRE-OnFire, 10DBA, 10Performance-Team (Radar), 10Sustainability (Incident Followup), 10Wikimedia-Incident: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest of the database infrastructure - https://phabricator.wikimedia.org/T303499 (10Krink... [18:28:40] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1021.eqiad.wmnet with reason: host reimage [18:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:44] 10SRE-OnFire, 10DBA, 10Performance-Team (Radar), 10Sustainability (Incident Followup), 10Wikimedia-Incident: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest of the database infrastructure - https://phabricator.wikimedia.org/T303499 (10Krink... [18:32:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P22451 and previous config saved to /var/cache/conftool/dbconfig/20220314-183214-marostegui.json [18:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:28] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:47:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P22452 and previous config saved to /var/cache/conftool/dbconfig/20220314-184719-marostegui.json [18:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:14] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1021.eqiad.wmnet with OS bullseye [18:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:25] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1022.eqiad.wmnet with OS bullseye [18:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300775)', diff saved to https://phabricator.wikimedia.org/P22453 and previous config saved to /var/cache/conftool/dbconfig/20220314-185849-marostegui.json [18:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:54] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [19:02:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298294)', diff saved to https://phabricator.wikimedia.org/P22454 and previous config saved to /var/cache/conftool/dbconfig/20220314-190224-marostegui.json [19:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:29] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [19:04:39] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1022.eqiad.wmnet with reason: host reimage [19:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:20] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [19:06:20] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:20] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1022.eqiad.wmnet with reason: host reimage [19:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:27] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 226.71 ms [19:11:28] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 244.25 ms [19:13:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P22455 and previous config saved to /var/cache/conftool/dbconfig/20220314-191354-marostegui.json [19:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:54] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1022.eqiad.wmnet with OS bullseye [19:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P22456 and previous config saved to /var/cache/conftool/dbconfig/20220314-192859-marostegui.json [19:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10nskaggs) a:05nskaggs→03None [19:43:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10nskaggs) a:03RobH Thanks Arzhel! I don't believe anything else is needed from me. Assigning back to @RobH. Feel free to ping a... [19:44:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300775)', diff saved to https://phabricator.wikimedia.org/P22457 and previous config saved to /var/cache/conftool/dbconfig/20220314-194404-marostegui.json [19:44:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [19:44:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [19:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:09] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [19:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:32] (03PS2) 10Ssingh: certspotter: re-enable systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/770012 (https://phabricator.wikimedia.org/T303593) [19:47:00] (03CR) 10Ssingh: [V: 03+1 C: 04-1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34280/console" [puppet] - 10https://gerrit.wikimedia.org/r/770012 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh) [19:47:13] (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: re-enable systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/770012 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh) [19:54:20] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:57:30] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:00:04] RoanKattouw and Urbanecm: Dear deployers, time to do the UTC late backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220314T2000). [20:00:04] nn1l2 and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] I can deploy today [20:00:49] I don't see nn1l2. ebernhardson, do you want to start with your patch? [20:00:56] i actually have a meeting now, will deploy in 30 min [20:01:23] ebernhardson: okay, happy meeting then :). let's wait. [20:11:32] urbanecm: meeting done quickly :) Shipping now [20:11:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10RobH) So these will need to go into WMCS dedicated 10G racks, not in rows E/F, which have access to the public1 vlan. [20:11:54] ebernhardson: sure thing. Ping me when done (or if you need my help). [20:12:18] (03CR) 10Ebernhardson: [C: 03+2] "backport window" [extensions/CirrusSearch] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770056 (https://phabricator.wikimedia.org/T302733) (owner: 10Ebernhardson) [20:12:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10RobH) a:05RobH→03nskaggs [20:14:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10RobH) [20:14:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10RobH) a:05nskaggs→03Jclark-ctr [20:21:10] 10SRE, 10envoy, 10serviceops: Clean up Puppet support for Envoy v2 config API - https://phabricator.wikimedia.org/T303770 (10RLazarus) [20:22:01] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [20:22:02] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [20:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:06] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:56] 10SRE, 10Beta-Cluster-Infrastructure, 10envoy, 10serviceops: Clean up Puppet support for Envoy v2 config API - https://phabricator.wikimedia.org/T303770 (10RLazarus) [20:30:09] (03Merged) 10jenkins-bot: Cut saneitizer re-indexing rate in half [extensions/CirrusSearch] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770056 (https://phabricator.wikimedia.org/T302733) (owner: 10Ebernhardson) [20:30:45] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1024.eqiad.wmnet with OS bullseye [20:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:00] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [20:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:07] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye [20:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:44] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [20:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:51] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye [20:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:54] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [20:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:34:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:57] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1023.eqiad.wmnet with reason: host reimage [20:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:37] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1023.eqiad.wmnet with reason: host reimage [20:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:34] is B&C still running? [20:45:17] nn1l2: i'm just pushing a patch now (gerrit always takes time to merge). I can do yours next [20:45:29] or if urbanecm is around they were going to i think [20:45:44] ebernhardson: up to you. You can do it or I can [20:45:44] !log ebernhardson@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/CirrusSearch/profiles/SaneitizeProfiles.config.php: Backport: [[gerrit:770056|Cut saneitizer re-indexing rate in half (T302733)]] (duration: 00m 49s) [20:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:48] T302733: Restore CirrusSearch saneitizer to production usage - https://phabricator.wikimedia.org/T302733 [20:46:05] There have been too many changes in the schedule recently [20:46:21] urbanecm: you're more an expert than me these days, go ahead :) [20:46:26] ebernhardson: will do :) [20:46:34] mines complete [20:46:37] ack [20:46:54] nn1l2: the schedule is the same as it was the previous week. It's "just" pinned to PDT timezone, not UTC [20:47:29] (to be more precise, P(D/S)T) [20:47:30] there is no good way to handle DST unfortunately [20:47:46] and yeah, US and Europe starts DST at different times of the year [20:48:30] anyway, let's start [20:48:45] (03CR) 10Urbanecm: [C: 03+2] liwiktionary: Change timezone to CET/CEST [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770523 (https://phabricator.wikimedia.org/T303734) (owner: 104nn1l2) [20:49:31] (03Merged) 10jenkins-bot: liwiktionary: Change timezone to CET/CEST [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770523 (https://phabricator.wikimedia.org/T303734) (owner: 104nn1l2) [20:49:43] nn1l2: fyi there will be another shift for similar reasons in a week or two (when Europe gets to DST) [20:49:59] Thanks! [20:50:05] (but in the opposite direction) [20:50:36] nn1l2: pulled to mwdebug1001 [20:50:38] can you check? [20:50:43] ok [20:52:29] LGTM [20:52:43] syncing [20:53:51] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bca9c94c9d0bec83cb777bc474fde564c441349c: liwiktionary: Change timezone to CET/CEST (T303734) (duration: 00m 49s) [20:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:56] nn1l2: should be live! [20:53:57] T303734: Change time on li.wiktionary to local time zone - https://phabricator.wikimedia.org/T303734 [20:53:57] anything else [20:54:04] No, thanks! [20:54:38] no problem! [20:54:49] !log UTC late B&C completed [20:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:06] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:55:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:56:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:04] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye [20:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] Reedy and sbassett: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220314T2100). [21:07:11] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1023.eqiad.wmnet with OS bullseye [21:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:14] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: build2001, cloudcontrol1003, cloudcontrol1004, cloudcontrol1005, gitlab-runner2001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [21:09:14] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: build2001, cloudcontrol1003, cloudcontrol1004, cloudcontrol1005, gitlab-runner2001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [21:10:31] (03PS1) 10Ssingh: certspotter: set send_mail_only_on_error to false [puppet] - 10https://gerrit.wikimedia.org/r/770600 [21:12:20] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34281/console" [puppet] - 10https://gerrit.wikimedia.org/r/770600 (owner: 10Ssingh) [21:14:57] (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: set send_mail_only_on_error to false [puppet] - 10https://gerrit.wikimedia.org/r/770600 (owner: 10Ssingh) [21:23:31] Hey all - I'd like to deploy a quick security patch (perm check) for T160800 [21:23:43] (03CR) 10Bking: [C: 03+2] [wdqs] adapt updateQueryServiceLag... [puppet] - 10https://gerrit.wikimedia.org/r/770508 (https://phabricator.wikimedia.org/T302494) (owner: 10DCausse) [21:29:31] PROBLEM - Check systemd state on an-worker1114 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:15] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye [21:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:08] !log Deployed security fix for T160800 [21:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:53] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:34:41] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:35:43] (03PS1) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/770605 (https://phabricator.wikimedia.org/T301565) [21:36:08] !log bking@cumin pooling codfw in DNS-discovery for wdqs and wdqs-internal services [21:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:18] (03CR) 10jerkins-bot: [V: 04-1] karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/770605 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi) [21:36:33] RECOVERY - Check systemd state on an-worker1114 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:03] !log bking@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=codfw [21:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:30] !log T302494 bking@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs-internal,name=codfw [21:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:35] T302494: The WDQS Streaming Updater should use S3 to access thanos-swift instead of the native swift protocol - https://phabricator.wikimedia.org/T302494 [21:39:07] (03PS1) 10Andrew Bogott: Update nic labels for cloudvirt1023/bullseye [puppet] - 10https://gerrit.wikimedia.org/r/770606 (https://phabricator.wikimedia.org/T281276) [21:39:28] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1025.eqiad.wmnet with OS bullseye [21:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:39:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:47] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye [21:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:11] (03CR) 10Andrew Bogott: [C: 03+2] Update nic labels for cloudvirt1023/bullseye [puppet] - 10https://gerrit.wikimedia.org/r/770606 (https://phabricator.wikimedia.org/T281276) (owner: 10Andrew Bogott) [21:40:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:56] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1025.eqiad.wmnet with OS bullseye [21:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:08] (03PS1) 10Ssingh: certspotter: more tuning: use OnUnitInactiveSec [puppet] - 10https://gerrit.wikimedia.org/r/770611 [21:55:08] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:56:04] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34282/console" [puppet] - 10https://gerrit.wikimedia.org/r/770611 (owner: 10Ssingh) [22:03:20] !log T302494 bking@puppetmaster1001 depooling eqiad in DNS-discovery for wdqs and wdqs-internal services [22:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:24] T302494: The WDQS Streaming Updater should use S3 to access thanos-swift instead of the native swift protocol - https://phabricator.wikimedia.org/T302494 [22:03:44] !log bking@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=eqiad [22:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:58] !log bking@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=wdqs-internal,name=eqiad [22:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:26] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [22:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [22:06:08] jouncebot: nowandnext [22:06:08] For the next 0 hour(s) and 53 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220314T2100) [22:06:09] In 2 hour(s) and 53 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220315T0100) [22:06:37] bumping all the remaining appservers and restbase machines to envoy 1.18 [22:06:51] no impact expected, the canaries were fine all weekend [22:16:35] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1023.eqiad.wmnet with reason: host reimage [22:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:14] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1023.eqiad.wmnet with reason: host reimage [22:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:22] (03PS1) 10Ryan Kemper: wdqs: fix data-transfer usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/770614 [22:21:05] (03PS2) 10Ryan Kemper: wdqs: fix data-transfer usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/770614 [22:21:19] (03CR) 10Bking: [C: 03+1] wdqs: fix data-transfer usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/770614 (owner: 10Ryan Kemper) [22:22:06] https://www.irccloud.com/pastebin/iFZnzWnM/ [22:22:15] oops... [22:22:17] Hi all...Should a person with the following user sting be able to access Wikipedia (given HTTPS HSTS)? [22:22:17] Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) [22:22:17] I'm thinking..yes? They are using a lower version of macOS than recommended on [22:22:17] https://wikitech.wikimedia.org/wiki/HTTPS/Browser_Recommendations#For_users_of_Apple_macOS [22:22:17] but they are using the lastest version of Chrome... [22:23:09] .... Chrome/99.0.4844.51 Safari/537.36 [22:24:06] i.e. Chrome 99 on Mac OS X (El Capitan) [22:25:18] Josve05a: unfortunately this is a known issue affecting certain older clients, including anything running on OS X 10.11 and earlier -- https://meta.wikimedia.org/wiki/HTTPS/2021_Let%27s_Encrypt_root_expiry has details [22:26:58] Ah, I was pretty much up to date than, only that I thought (given info on https://wikitech.wikimedia.org/wiki/HTTPS/Browser_Recommendations) that old Macs could still access Wikipedia if they had an updated compatible browser [22:27:08] but then I know, thanks! [22:27:13] then* [22:27:35] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:14] Josve05a: if upgrading macOS isn't an option then Firefox might work, but I think all the other major browsers depend on the OS for this [22:28:25] !log T301108 `ryankemper@cumin1001:~$ sudo cookbook sre.wdqs.data-transfer --source wdqs1009.eqiad.wmnet --dest wdqs1010.eqiad.wmnet --reason "moving away from legacy updater" --blazegraph_instance wikidata --without-lvs --task-id T301108` on tmux `wdqs` [22:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:29] T301108: Migrate wdqs1010 to the Flink based Streaming Updater and cleanup left over pieces of the old updater - https://phabricator.wikimedia.org/T301108 [22:28:43] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:43] rzl: Ah, thanks! [22:32:01] We need to update that page above and some VRT response templates it seems... [22:32:20] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:46] yeah - that firefox tip is just from glancing at letsencrypt documentation, I don't have a machine handy to test but I'll ask around to confirm [22:33:07] one way or the other I'll see about getting that page clarified [22:34:04] (from the edit history, it looks like the "If that is not possible, [...] consider installing an alternate secure browser" sentence is older than the LE issue, so it was probably a workaround for a previous Safari issue) [22:37:12] Yeah, I'll relay that information to the end-user and see if they can get Firefox to work or if they can somehow upgrade their OS [22:37:22] Thanks again [22:37:27] 👍 [22:44:15] (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: more tuning: use OnUnitInactiveSec [puppet] - 10https://gerrit.wikimedia.org/r/770611 (owner: 10Ssingh) [23:08:49] 10SRE, 10Security-Team, 10Stewards-and-global-tools: Investigate the practice of making thousands of global blocks per day on Meta-Wiki - https://phabricator.wikimedia.org/T303774 (10AntiCompositeNumber) This is necessary mitigation for T265845. > Is issuing thousands of global blocks per day now an accepte... [23:20:17] (03PS1) 10Tim Starling: populateGlobalEditCount.php: skip lu_global_id=0 and add restart option [extensions/CentralAuth] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770058 [23:20:29] (03CR) 10Tim Starling: [C: 03+2] populateGlobalEditCount.php: skip lu_global_id=0 and add restart option [extensions/CentralAuth] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770058 (owner: 10Tim Starling) [23:22:42] (03Merged) 10jenkins-bot: populateGlobalEditCount.php: skip lu_global_id=0 and add restart option [extensions/CentralAuth] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770058 (owner: 10Tim Starling) [23:26:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:27:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [23:44:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [23:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22460 and previous config saved to /var/cache/conftool/dbconfig/20220314-234430-marostegui.json [23:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:34] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775