[00:17:39] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:23:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:27:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:38:01] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:44:17] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:37] !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [00:44:43] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [00:55:55] (03PS1) 10Eevans: Merge Cassandra 3.11.13 configuration changes [puppet] - 10https://gerrit.wikimedia.org/r/815822 (https://phabricator.wikimedia.org/T309896) [01:00:08] (03PS1) 10Ryan Kemper: elastic: prep to bring elastic20[64-72] in [puppet] - 10https://gerrit.wikimedia.org/r/815823 (https://phabricator.wikimedia.org/T300943) [01:15:27] RECOVERY - Check systemd state on elastic2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:23:11] (03CR) 10Eevans: "PPC output: https://puppet-compiler.wmflabs.org/pcc-worker1001/36336/" [puppet] - 10https://gerrit.wikimedia.org/r/815822 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [01:37:19] RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) resolved: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:16] 10SRE, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) [02:48:33] 10SRE, 10MW-on-K8s, 10serviceops: Sandbox/limit child processes within a container runtime - https://phabricator.wikimedia.org/T252745 (10tstarling) [02:48:41] 10SRE, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) 05Open→03Resolved [02:51:27] 10SRE, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) [03:10:17] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) Unfortunately `tasks_per_second` was only added in 2.27, and we're running 2.10. [05:05:07] (03PS1) 10KartikMistry: Enable Section Translation in Uzbek Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815829 (https://phabricator.wikimedia.org/T310116) [05:05:59] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:08:29] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:40] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T313383 [05:13:44] T313383: Switchover s7 master db1181 -> db1136 - https://phabricator.wikimedia.org/T313383 [05:13:53] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:13:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1136 with weight 0 T313383', diff saved to https://phabricator.wikimedia.org/P31559 and previous config saved to /var/cache/conftool/dbconfig/20220721-051358-root.json [05:14:09] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T313383 [05:14:39] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Primary switchover x1 T313398 [05:14:43] T313398: Failover x1 master db1120 -> db1103 - https://phabricator.wikimedia.org/T313398 [05:14:59] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Primary switchover x1 T313398 [05:15:57] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:17:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1103 with weight 0 T313398', diff saved to https://phabricator.wikimedia.org/P31560 and previous config saved to /var/cache/conftool/dbconfig/20220721-051752-root.json [05:21:23] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:27:18] (03PS3) 10Marostegui: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383) [05:28:55] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:30:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui) [05:32:06] (03PS1) 10Marostegui: mariadb: Promote db1103 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/815830 (https://phabricator.wikimedia.org/T313398) [05:33:57] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:38:31] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:47:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:51:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Handle socket.timeout the same way as TimeoutError [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/814893 (owner: 10Ahmon Dancy) [05:51:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Raise the default connection timeout to 2 seconds [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815747 (https://phabricator.wikimedia.org/T310835) (owner: 10Giuseppe Lavagetto) [05:51:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] New version 0.0.3 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815748 (owner: 10Giuseppe Lavagetto) [05:52:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:05] kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T0600). [06:00:12] o/ [06:00:20] hey [06:00:20] marostegui: I'm reviewing your patch [06:00:21] Starting [06:00:23] thanks! [06:00:29] !log Starting s7 eqiad failover from db1181 to db1136 - T313383 [06:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:33] T313383: Switchover s7 master db1181 -> db1136 - https://phabricator.wikimedia.org/T313383 [06:00:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s7 eqiad as read-only for maintenance - T313383', diff saved to https://phabricator.wikimedia.org/P31561 and previous config saved to /var/cache/conftool/dbconfig/20220721-060037-marostegui.json [06:00:41] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote db1103 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/815830 (https://phabricator.wikimedia.org/T313398) (owner: 10Marostegui) [06:01:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1136 to s7 primary and set section read-write T313383', diff saved to https://phabricator.wikimedia.org/P31562 and previous config saved to /var/cache/conftool/dbconfig/20220721-060112-root.json [06:01:15] s7 switched [06:02:05] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover s7 master [dns] - 10https://gerrit.wikimedia.org/r/815709 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui) [06:04:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1181', diff saved to https://phabricator.wikimedia.org/P31563 and previous config saved to /var/cache/conftool/dbconfig/20220721-060427-marostegui.json [06:06:18] (03PS1) 10Marostegui: wmnet: Update x1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/815831 (https://phabricator.wikimedia.org/T313398) [06:06:19] Amir1: Need a review of ^ too [06:06:36] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Update x1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/815831 (https://phabricator.wikimedia.org/T313398) (owner: 10Marostegui) [06:06:40] done [06:07:00] let's do x1 master switch then? [06:07:21] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T313337 (10ayounsi) 05Open→03Resolved Both are back to normal. Thanks! [06:07:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1103 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/815830 (https://phabricator.wikimedia.org/T313398) (owner: 10Marostegui) [06:08:29] Amir1: Ready for x1? [06:08:38] sure [06:08:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [06:08:56] !log Starting x1 eqiad failover from db1120 to db1103 - T313398 [06:08:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [06:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:00] T313398: Failover x1 master db1120 -> db1103 - https://phabricator.wikimedia.org/T313398 [06:10:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1103 to x1 primary and set section read-write T313398', diff saved to https://phabricator.wikimedia.org/P31564 and previous config saved to /var/cache/conftool/dbconfig/20220721-061001-root.json [06:10:10] x1 switched [06:10:34] (03CR) 10Marostegui: [C: 03+2] wmnet: Update x1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/815831 (https://phabricator.wikimedia.org/T313398) (owner: 10Marostegui) [06:10:48] Amir1: can you generate a write on x1? [06:10:54] sure [06:11:28] marostegui: works [06:11:32] (url shortener) [06:11:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 T313398', diff saved to https://phabricator.wikimedia.org/P31565 and previous config saved to /var/cache/conftool/dbconfig/20220721-061145-root.json [06:12:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31566 and previous config saved to /var/cache/conftool/dbconfig/20220721-061217-root.json [06:12:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 1%: After restart', diff saved to https://phabricator.wikimedia.org/P31567 and previous config saved to /var/cache/conftool/dbconfig/20220721-061228-root.json [06:13:27] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) [06:13:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) [06:15:03] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) Both masters, s7 and x1 have been switched over and no longer live in this rack. [06:15:15] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2026.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [06:15:19] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [06:15:25] marostegui: Thank you <3 [06:15:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:15:39] Amir1: thanks for the help <3 <3 <3 [06:15:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2026.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [06:15:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:16:04] (03CR) 10Ayounsi: [C: 03+1] transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724 (owner: 10Volans) [06:17:50] marostegui: can I run a schema change on the s7 old master? [06:18:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2009.codfw.wmnet with OS bullseye [06:18:20] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2009.codfw.wmnet with OS bullseye [06:23:12] Amir1: yes, I'm repooling it though [06:23:27] can you wait a bit? I'm going to get breakfast [06:23:32] let me know once it's done [06:23:33] sure [06:23:40] (03PS1) 10Tim Starling: Don't send debug log from test2wiki to testwiki.log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815834 (https://phabricator.wikimedia.org/T279664) [06:24:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [06:24:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [06:24:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T312990)', diff saved to https://phabricator.wikimedia.org/P31568 and previous config saved to /var/cache/conftool/dbconfig/20220721-062431-marostegui.json [06:24:36] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [06:26:19] 10SRE, 10Performance-Team, 10Traffic, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [06:26:33] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:27:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 2%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31569 and previous config saved to /var/cache/conftool/dbconfig/20220721-062722-root.json [06:27:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 2%: After restart', diff saved to https://phabricator.wikimedia.org/P31570 and previous config saved to /var/cache/conftool/dbconfig/20220721-062733-root.json [06:27:56] (03CR) 10Krinkle: [C: 03+1] Don't send debug log from test2wiki to testwiki.log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815834 (https://phabricator.wikimedia.org/T279664) (owner: 10Tim Starling) [06:34:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T312990)', diff saved to https://phabricator.wikimedia.org/P31571 and previous config saved to /var/cache/conftool/dbconfig/20220721-063417-marostegui.json [06:34:23] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [06:34:49] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2009.codfw.wmnet with reason: host reimage [06:36:46] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2003.codfw.wmnet with reason: Switch instance to DRBD, T311686 [06:36:50] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [06:37:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2003.codfw.wmnet with reason: Switch instance to DRBD, T311686 [06:38:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2009.codfw.wmnet with reason: host reimage [06:39:30] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) p:05Triage→03Medium [06:41:16] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [06:41:22] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) [06:41:30] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10ayounsi) [06:42:22] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [06:42:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31572 and previous config saved to /var/cache/conftool/dbconfig/20220721-064226-root.json [06:42:28] 10SRE, 10Infrastructure-Foundations, 10netops: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10ayounsi) [06:42:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 5%: After restart', diff saved to https://phabricator.wikimedia.org/P31573 and previous config saved to /var/cache/conftool/dbconfig/20220721-064237-root.json [06:46:15] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:47:07] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [06:47:12] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: Switch instance to plain disks, T311686 [06:47:16] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [06:47:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: Switch instance to plain disks, T311686 [06:49:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P31574 and previous config saved to /var/cache/conftool/dbconfig/20220721-064922-marostegui.json [06:52:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2009.codfw.wmnet with OS bullseye [06:52:54] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2009.codfw.wmnet with OS bullseye completed: - ganeti2009 (**PASS**) - Downtimed on... [06:54:22] (03PS2) 10Ladsgroup: wwwportals: Make sure portal assets are also visible in wikiquote vhost [puppet] - 10https://gerrit.wikimedia.org/r/815794 (https://phabricator.wikimedia.org/T273179) [06:54:28] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wwwportals: Make sure portal assets are also visible in wikiquote vhost [puppet] - 10https://gerrit.wikimedia.org/r/815794 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [06:55:19] (03PS2) 10Giuseppe Lavagetto: sre: add php busy workers alerts for parsoid, jobrunners [alerts] - 10https://gerrit.wikimedia.org/r/797313 [06:55:21] (03PS2) 10Giuseppe Lavagetto: sre: pretty-format mediawiki.yaml [alerts] - 10https://gerrit.wikimedia.org/r/797314 [06:55:23] (03PS2) 10Giuseppe Lavagetto: sre: add alerting for mediawiki on k8s [alerts] - 10https://gerrit.wikimedia.org/r/797315 [06:55:49] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [06:57:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31575 and previous config saved to /var/cache/conftool/dbconfig/20220721-065730-root.json [06:57:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2026.codfw.wmnet with OS bullseye [06:57:39] (03CR) 10CI reject: [V: 04-1] sre: add alerting for mediawiki on k8s [alerts] - 10https://gerrit.wikimedia.org/r/797315 (owner: 10Giuseppe Lavagetto) [06:57:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 10%: After restart', diff saved to https://phabricator.wikimedia.org/P31576 and previous config saved to /var/cache/conftool/dbconfig/20220721-065741-root.json [06:57:42] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS bullseye [06:57:44] (03CR) 10CI reject: [V: 04-1] sre: pretty-format mediawiki.yaml [alerts] - 10https://gerrit.wikimedia.org/r/797314 (owner: 10Giuseppe Lavagetto) [06:58:10] (03CR) 10CI reject: [V: 04-1] sre: add php busy workers alerts for parsoid, jobrunners [alerts] - 10https://gerrit.wikimedia.org/r/797313 (owner: 10Giuseppe Lavagetto) [07:00:04] Amir1, apergos, jnuche, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T0700). [07:00:14] Good morning! no trainees are signed up today and there are no patches in the window. [07:00:55] this means that if anyone wants to self-deploy and has something to sneak in, they can add it to the calendar and speak up now [07:01:04] if not, I'll be wandering off in about 10 minutes as usual [07:04:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P31577 and previous config saved to /var/cache/conftool/dbconfig/20220721-070427-marostegui.json [07:08:54] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [07:10:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2020.codfw.wmnet to cluster codfw and group B [07:11:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2020.codfw.wmnet to cluster codfw and group B [07:12:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31578 and previous config saved to /var/cache/conftool/dbconfig/20220721-071234-root.json [07:12:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: After restart', diff saved to https://phabricator.wikimedia.org/P31579 and previous config saved to /var/cache/conftool/dbconfig/20220721-071245-root.json [07:13:45] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2026.codfw.wmnet with reason: host reimage [07:14:28] (03CR) 10Filippo Giunchedi: [C: 03+2] phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:14:33] (03PS8) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) [07:16:09] (03CR) 10Filippo Giunchedi: [V: 03+2] phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:16:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2026.codfw.wmnet with reason: host reimage [07:17:55] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [07:19:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T312990)', diff saved to https://phabricator.wikimedia.org/P31580 and previous config saved to /var/cache/conftool/dbconfig/20220721-071932-marostegui.json [07:19:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:19:37] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [07:19:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:19:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T312990)', diff saved to https://phabricator.wikimedia.org/P31581 and previous config saved to /var/cache/conftool/dbconfig/20220721-071953-marostegui.json [07:20:07] (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815895 (https://phabricator.wikimedia.org/T273179) [07:20:23] given no takers for self deploy, I am wandering off, as advertised. see everyone next time! [07:20:39] (03CR) 10CI reject: [V: 04-1] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815895 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [07:21:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2009.codfw.wmnet [07:22:34] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall !" [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn) [07:23:34] (03PS2) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815895 (https://phabricator.wikimedia.org/T273179) [07:23:43] (03CR) 10Filippo Giunchedi: [C: 03+2] mw_rc_irc: check ircd availability with blackbox prober (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805815 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:23:49] (03PS17) 10Filippo Giunchedi: mw_rc_irc: check ircd availability with blackbox prober [puppet] - 10https://gerrit.wikimedia.org/r/805815 (https://phabricator.wikimedia.org/T305847) [07:26:15] (03PS3) 10Giuseppe Lavagetto: sre: add php busy workers alerts for parsoid, jobrunners [alerts] - 10https://gerrit.wikimedia.org/r/797313 [07:26:17] (03PS3) 10Giuseppe Lavagetto: sre: pretty-format mediawiki.yaml [alerts] - 10https://gerrit.wikimedia.org/r/797314 [07:26:19] (03PS3) 10Giuseppe Lavagetto: sre: add alerting for mediawiki on k8s [alerts] - 10https://gerrit.wikimedia.org/r/797315 [07:27:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31582 and previous config saved to /var/cache/conftool/dbconfig/20220721-072738-root.json [07:27:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 50%: After restart', diff saved to https://phabricator.wikimedia.org/P31583 and previous config saved to /var/cache/conftool/dbconfig/20220721-072749-root.json [07:29:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T312990)', diff saved to https://phabricator.wikimedia.org/P31584 and previous config saved to /var/cache/conftool/dbconfig/20220721-072934-marostegui.json [07:29:38] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [07:30:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2009.codfw.wmnet [07:30:20] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:30:20] (03CR) 10Ladsgroup: [C: 03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815895 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [07:31:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:31:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:31:11] (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815895 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [07:31:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [07:31:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [07:31:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [07:31:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [07:31:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [07:32:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2026.codfw.wmnet with OS bullseye [07:32:05] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS bullseye completed: - ganeti2026 (**PASS**) - Downtimed on... [07:32:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [07:32:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T312863)', diff saved to https://phabricator.wikimedia.org/P31585 and previous config saved to /var/cache/conftool/dbconfig/20220721-073217-ladsgroup.json [07:32:21] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [07:32:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:32:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:32:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre: add php busy workers alerts for parsoid, jobrunners [alerts] - 10https://gerrit.wikimedia.org/r/797313 (owner: 10Giuseppe Lavagetto) [07:32:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T312863)', diff saved to https://phabricator.wikimedia.org/P31586 and previous config saved to /var/cache/conftool/dbconfig/20220721-073251-ladsgroup.json [07:33:03] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] sre: add php busy workers alerts for parsoid, jobrunners [alerts] - 10https://gerrit.wikimedia.org/r/797313 (owner: 10Giuseppe Lavagetto) [07:33:05] (03PS1) 10Filippo Giunchedi: prometheus: default server_name to hostname in tcp check [puppet] - 10https://gerrit.wikimedia.org/r/815897 (https://phabricator.wikimedia.org/T305847) [07:34:29] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: default server_name to hostname in tcp check [puppet] - 10https://gerrit.wikimedia.org/r/815897 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:34:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:34:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:35:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T312863)', diff saved to https://phabricator.wikimedia.org/P31587 and previous config saved to /var/cache/conftool/dbconfig/20220721-073502-ladsgroup.json [07:35:30] (03Merged) 10jenkins-bot: sre: add php busy workers alerts for parsoid, jobrunners [alerts] - 10https://gerrit.wikimedia.org/r/797313 (owner: 10Giuseppe Lavagetto) [07:37:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:38:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:38:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:39:13] looks good, syncing [07:39:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:42:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31588 and previous config saved to /var/cache/conftool/dbconfig/20220721-074242-root.json [07:42:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 75%: After restart', diff saved to https://phabricator.wikimedia.org/P31589 and previous config saved to /var/cache/conftool/dbconfig/20220721-074253-root.json [07:43:11] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) I have installed the new package also on db1111 (s8) and db1127 (s7), currently depooled. If all goes fine, I will pool... [07:43:23] !log ladsgroup@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:815895|Adding Wikiquote to the new portals (T273179)]] (duration: 03m 08s) [07:43:26] T273179: Update the front-page of Wikimedia projects - https://phabricator.wikimedia.org/T273179 [07:44:25] 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Krinkle) [07:44:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P31590 and previous config saved to /var/cache/conftool/dbconfig/20220721-074439-marostegui.json [07:46:33] !log ladsgroup@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:815895|Adding Wikiquote to the new portals (T273179)]] (duration: 03m 10s) [07:46:35] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [07:47:10] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:49:47] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [07:52:20] (03CR) 10Vgutierrez: P:varnish::common: Add support for passing wikimedia_domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815728 (owner: 10Jbond) [07:54:33] 10SRE-swift-storage: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 (10MatthewVernon) @tstarling thanks for finding that history. That is pushing me harder towards "just disable the request replication", particularly given it looks lik... [07:55:53] (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [07:57:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31591 and previous config saved to /var/cache/conftool/dbconfig/20220721-075745-root.json [07:57:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: After restart', diff saved to https://phabricator.wikimedia.org/P31592 and previous config saved to /var/cache/conftool/dbconfig/20220721-075757-root.json [07:58:00] 10SRE, 10Security-Team, 10Traffic, 10SecTeam-Processed, 10Security: US Department of Homeland Security (DHS) IP blocks - https://phabricator.wikimedia.org/T303055 (10ayounsi) 05Open→03Resolved Thank you all, network block removed. [07:59:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P31593 and previous config saved to /var/cache/conftool/dbconfig/20220721-075944-marostegui.json [07:59:56] (03PS1) 10Elukey: Add fake secrets for the ml revscoring-articlequality-topic k9s ns [labs/private] - 10https://gerrit.wikimedia.org/r/815900 (https://phabricator.wikimedia.org/T313307) [08:00:04] jeena and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T0800). [08:00:56] (03PS2) 10Elukey: Add fake secrets for the ml revscoring-articlequality-topic k9s ns [labs/private] - 10https://gerrit.wikimedia.org/r/815900 (https://phabricator.wikimedia.org/T313307) [08:01:55] (03PS3) 10Elukey: Add fake secrets for the ml revscoring-articletopic k8s ns [labs/private] - 10https://gerrit.wikimedia.org/r/815900 (https://phabricator.wikimedia.org/T313307) [08:02:49] 10SRE-swift-storage: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 (10fgiunchedi) Thank you indeed for the context @tstarling, +1 to disable the request replication @MatthewVernon [08:04:08] (03PS1) 10Elukey: profile::k8s::deployment_server: add ML revscoring-articletopic users [puppet] - 10https://gerrit.wikimedia.org/r/815903 (https://phabricator.wikimedia.org/T313307) [08:08:46] (03PS1) 10Elukey: helmfile.d: add configuration for the ML revscoring-articletopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/815905 (https://phabricator.wikimedia.org/T313307) [08:10:25] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) [08:14:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T312990)', diff saved to https://phabricator.wikimedia.org/P31594 and previous config saved to /var/cache/conftool/dbconfig/20220721-081449-marostegui.json [08:14:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [08:14:54] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [08:15:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [08:18:19] !log installing containerd security updates in Kubernetes eqiad workers [08:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:38] (03CR) 10Klausman: [C: 03+1] Add fake secrets for the ml revscoring-articletopic k8s ns [labs/private] - 10https://gerrit.wikimedia.org/r/815900 (https://phabricator.wikimedia.org/T313307) (owner: 10Elukey) [08:19:55] (03PS1) 10Marostegui: db2169: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815906 (https://phabricator.wikimedia.org/T311493) [08:20:53] (03CR) 10Klausman: [C: 03+1] helmfile.d: add configuration for the ML revscoring-articletopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/815905 (https://phabricator.wikimedia.org/T313307) (owner: 10Elukey) [08:23:32] (03CR) 10Marostegui: [C: 03+2] db2169: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815906 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [08:23:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [08:24:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [08:24:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [08:24:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [08:27:10] (03PS1) 10Marostegui: instances.yaml: Add db2169 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/815907 (https://phabricator.wikimedia.org/T311493) [08:29:36] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2169 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/815907 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [08:30:58] (03CR) 10Klausman: [C: 03+1] profile::k8s::deployment_server: add ML revscoring-articletopic users [puppet] - 10https://gerrit.wikimedia.org/r/815903 (https://phabricator.wikimedia.org/T313307) (owner: 10Elukey) [08:31:00] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/815757 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [08:31:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2169 to s6 and s7 T311493', diff saved to https://phabricator.wikimedia.org/P31595 and previous config saved to /var/cache/conftool/dbconfig/20220721-083147-marostegui.json [08:31:52] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [08:32:36] (03CR) 10Volans: [C: 03+2] transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724 (owner: 10Volans) [08:33:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [08:34:14] (03CR) 10Klausman: [C: 03+2] Add fake secrets for the ml revscoring-articletopic k8s ns [labs/private] - 10https://gerrit.wikimedia.org/r/815900 (https://phabricator.wikimedia.org/T313307) (owner: 10Elukey) [08:34:50] (03CR) 10Klausman: [V: 03+2 C: 03+2] Add fake secrets for the ml revscoring-articletopic k8s ns [labs/private] - 10https://gerrit.wikimedia.org/r/815900 (https://phabricator.wikimedia.org/T313307) (owner: 10Elukey) [08:37:10] (03Merged) 10jenkins-bot: transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724 (owner: 10Volans) [08:40:38] (03CR) 10Klausman: [C: 03+2] helmfile.d: add configuration for the ML revscoring-articletopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/815905 (https://phabricator.wikimedia.org/T313307) (owner: 10Elukey) [08:40:41] (03PS1) 10Ayounsi: Netbox-next: Allow login from NDA users [puppet] - 10https://gerrit.wikimedia.org/r/815908 (https://phabricator.wikimedia.org/T302870) [08:41:01] (03CR) 10Klausman: [C: 03+2] profile::k8s::deployment_server: add ML revscoring-articletopic users [puppet] - 10https://gerrit.wikimedia.org/r/815903 (https://phabricator.wikimedia.org/T313307) (owner: 10Elukey) [08:41:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [08:44:30] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10MoritzMuehlenhoff) >>! In T211661#8093119, @ori wrote: > Unfortunately `tasks_per_second` was only added in 2.27, and we're runni... [08:45:32] (03CR) 10Volans: "Couple of nits and a small bug inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [08:49:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [08:49:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [08:49:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T312990)', diff saved to https://phabricator.wikimedia.org/P31597 and previous config saved to /var/cache/conftool/dbconfig/20220721-084935-marostegui.json [08:49:39] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [08:50:54] 10SRE, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10ayounsi) Send the above patch to grant access (`is_active`). The permissions page though seems to involve quite a lot of manual work see fo... [08:52:54] (03CR) 10Volans: [C: 03+1] "I *think* this is ok, but would like John to also have a look." [puppet] - 10https://gerrit.wikimedia.org/r/815908 (https://phabricator.wikimedia.org/T302870) (owner: 10Ayounsi) [08:53:36] RECOVERY - k8s requests count to the API on ml-serve-ctrl2002 is OK: (C)100 ge (W)50 ge 37.68 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [08:53:51] !log klausman@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [08:54:09] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:54:19] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:54:34] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [08:54:44] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [08:54:50] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:54:54] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:55:34] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:57:21] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:59:29] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:59:40] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:00:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T312863)', diff saved to https://phabricator.wikimedia.org/P31598 and previous config saved to /var/cache/conftool/dbconfig/20220721-090022-ladsgroup.json [09:00:26] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [09:05:39] (HelmReleaseBadStatus) firing: Helm release kube-system/namespaces on k8s-staging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:10:32] \o I may have accidentally synced namespaces on staging-codfw [09:10:52] klausman: did you ^C out of it? [09:10:58] probably. [09:11:25] unfortunately, the windows with the terminal output is gone. But the command is definitely in the bash history [09:11:57] Out of curiosity, does a ^C-ed sync actually break anything? [09:12:13] yeahno [09:12:22] :) [09:12:44] depends on the exact time you kill it I guess [09:12:50] Probably just "eh, got half a sync" wanrings? [09:13:25] (03PS1) 10David Caro: prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910 [09:13:41] helm will update the state of the release/deployment in last step (client side). So you might end up with a release that is actually deployed correctly but lacks the state change [09:14:17] well, at least I only did it to staging :-S [09:14:22] you should be safe rolling back to the last deployed revision [09:14:30] (03CR) 10Jbond: "The original" [puppet] - 10https://gerrit.wikimedia.org/r/815728 (owner: 10Jbond) [09:14:38] (03PS4) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/815728 [09:14:42] (03CR) 10CI reject: [V: 04-1] prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [09:14:45] like described in the linked page above [09:15:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P31599 and previous config saved to /var/cache/conftool/dbconfig/20220721-091527-ladsgroup.json [09:15:35] I may have joined the channel too late to see that link [09:16:49] (03CR) 10Vgutierrez: [C: 03+1] P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/815728 (owner: 10Jbond) [09:17:34] (03CR) 10JMeybohm: [C: 03+2] k8s: Adapt retry parameters to reality [software/spicerack] - 10https://gerrit.wikimedia.org/r/815757 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:18:01] !log disable puppet on A:cp for gerrit:815728 [09:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:28] 10SRE, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10ayounsi) I had a quick look at demo.netbox.dev and created a test user there (you can try with user foobar/foobar, the DB is reset every day... [09:19:09] klausman: (HelmReleaseBadStatus) firing: Helm release kube-system/namespaces on k8s-staging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:19:27] (03CR) 10Jbond: [C: 03+2] P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/815728 (owner: 10Jbond) [09:20:24] lmk if you don't feel safe doing that - but you can't really break anything relevant as long as you work on staging-codfw :) [09:20:32] jayme: alright. [09:21:37] !log installing containerd security updates in Kubernetes eqiad masters [09:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:07] jayme: So the command i ran was `helmfile -e staging-codfw -l name=namespaces sync` [09:23:21] I don't think (i.e. the one in error) [09:23:30] oops, dieregard the first half before ( [09:24:08] So I am not sure what $SERVICE or $RELEASE would be [09:24:46] try "helm -n kube-system history namespaces" [09:25:44] $SERVICE ultimately is the namespace which is the same as the service name for all non admin_ng deployments. [09:26:02] and "kube-system" for all admin_ng deployments [09:26:06] Alright. I think I want release 54, from Apr 27. [09:26:49] https://phabricator.wikimedia.org/P31600 [09:27:03] 👍 [09:27:09] (03Merged) 10jenkins-bot: k8s: Adapt retry parameters to reality [software/spicerack] - 10https://gerrit.wikimedia.org/r/815757 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:27:18] Alright. Sorry for the noise again [09:27:22] np [09:27:51] thanks for testing my alert :-p [09:30:11] Any time! ;) [09:30:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P31601 and previous config saved to /var/cache/conftool/dbconfig/20220721-093032-ladsgroup.json [09:30:39] (HelmReleaseBadStatus) resolved: Helm release kube-system/namespaces on k8s-staging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:30:50] (03PS1) 10Kevin Bazira: ml-services: Add arwiki articletopic isvc to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/815911 (https://phabricator.wikimedia.org/T313307) [09:32:49] !log enable puppet on A:cp post gerrit:815728 [09:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:55] (03CR) 10CI reject: [V: 04-1] ml-services: Add arwiki articletopic isvc to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/815911 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [09:34:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1136.eqiad.wmnet with reason: Maintenance [09:35:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1136.eqiad.wmnet with reason: Maintenance [09:36:09] (03CR) 10Vgutierrez: C:varnish: improve error messaging for reload-vcl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815761 (owner: 10Jbond) [09:37:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:37:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:37:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T312984)', diff saved to https://phabricator.wikimedia.org/P31602 and previous config saved to /var/cache/conftool/dbconfig/20220721-093755-ladsgroup.json [09:38:01] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [09:42:50] (03PS2) 10Jbond: C:varnish: improve error messaging for reload-vcl [puppet] - 10https://gerrit.wikimedia.org/r/815761 [09:42:56] (03PS1) 10Marostegui: instances: Remove db2085 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/815914 (https://phabricator.wikimedia.org/T313239) [09:43:13] (03CR) 10Jbond: C:varnish: improve error messaging for reload-vcl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815761 (owner: 10Jbond) [09:44:45] (03CR) 10Marostegui: [C: 03+2] instances: Remove db2085 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/815914 (https://phabricator.wikimedia.org/T313239) (owner: 10Marostegui) [09:46:33] (03PS1) 10Vgutierrez: cloud: Enable UDS on varnish @ traffic-cache-atstext-buster [puppet] - 10https://gerrit.wikimedia.org/r/815915 [09:47:36] (03CR) 10Vgutierrez: [C: 03+2] cloud: Enable UDS on varnish @ traffic-cache-atstext-buster [puppet] - 10https://gerrit.wikimedia.org/r/815915 (owner: 10Vgutierrez) [09:49:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [09:52:19] (03PS2) 10David Caro: prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910 [09:53:15] (03PS1) 10Marostegui: instances.yaml: Remove db2086 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/815916 (https://phabricator.wikimedia.org/T313482) [09:53:40] 10SRE, 10Wikimedia-Mailing-lists: Volunteer account erroneously linked with official email id - https://phabricator.wikimedia.org/T313321 (10Aklapper) > The verification is comes to my WMF email but the user is my volunteer account. I'm not sure what "the user" is. Isn't that the username that you chose when r... [09:54:02] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2086 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/815916 (https://phabricator.wikimedia.org/T313482) (owner: 10Marostegui) [09:54:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2085 and db2086 from dbctl [3~', diff saved to https://phabricator.wikimedia.org/P31603 and previous config saved to /var/cache/conftool/dbconfig/20220721-095439-marostegui.json [09:54:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T312990)', diff saved to https://phabricator.wikimedia.org/P31604 and previous config saved to /var/cache/conftool/dbconfig/20220721-095446-marostegui.json [09:54:50] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [09:54:54] (03CR) 10Jbond: [C: 03+1] Netbox-next: Allow login from NDA users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815908 (https://phabricator.wikimedia.org/T302870) (owner: 10Ayounsi) [09:54:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T312863)', diff saved to https://phabricator.wikimedia.org/P31605 and previous config saved to /var/cache/conftool/dbconfig/20220721-095454-ladsgroup.json [09:54:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [09:54:58] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [09:55:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [09:56:05] (03CR) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [09:56:27] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2001.codfw.wmnet with reason: Switch instance to plain disk storage, T311686 [09:56:31] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [09:56:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2001.codfw.wmnet with reason: Switch instance to plain disk storage, T311686 [09:57:34] (03PS3) 10David Caro: prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910 [09:57:36] (03CR) 10David Caro: prometheus: Add icmp blackbox check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [10:00:05] mvolz: Your horoscope predicts another unfortunate Services – Citoid / Zotero deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T1000). [10:00:05] (03PS45) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [10:00:59] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/815298 (owner: 10PipelineBot) [10:01:21] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/814170 (owner: 10PipelineBot) [10:01:31] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/814358 (owner: 10PipelineBot) [10:04:20] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [10:05:01] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/815298 (owner: 10PipelineBot) [10:05:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2009.codfw.wmnet to cluster codfw and group C [10:06:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2009.codfw.wmnet to cluster codfw and group C [10:07:25] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I did a pass through the nignx configuration, and I think it's overall correct, I have a few questions inline, but I think it can even go " [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [10:09:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P31606 and previous config saved to /var/cache/conftool/dbconfig/20220721-100951-marostegui.json [10:10:50] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [10:11:24] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [10:13:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T312984)', diff saved to https://phabricator.wikimedia.org/P31607 and previous config saved to /var/cache/conftool/dbconfig/20220721-101341-ladsgroup.json [10:13:46] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [10:13:48] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [10:14:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2026.codfw.wmnet to cluster codfw and group D [10:14:43] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2026.codfw.wmnet to cluster codfw and group D [10:15:01] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [10:15:37] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [10:16:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [10:17:31] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [10:18:11] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [10:19:33] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:24:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [10:24:27] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48391 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:24:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P31608 and previous config saved to /var/cache/conftool/dbconfig/20220721-102457-marostegui.json [10:28:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P31609 and previous config saved to /var/cache/conftool/dbconfig/20220721-102846-ladsgroup.json [10:33:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2026.codfw.wmnet to cluster codfw and group D [10:34:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2026.codfw.wmnet to cluster codfw and group D [10:35:10] if any phab admin around, please see T313487 or -releng [10:39:37] (03CR) 10Klausman: ml-services: Add arwiki articletopic isvc to staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/815911 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [10:39:38] Amir1: can I bribe you into a very long winded dance with phab ACLs? [10:40:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T312990)', diff saved to https://phabricator.wikimedia.org/P31610 and previous config saved to /var/cache/conftool/dbconfig/20220721-104002-marostegui.json [10:40:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [10:40:07] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [10:40:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [10:40:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:40:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:40:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T312990)', diff saved to https://phabricator.wikimedia.org/P31611 and previous config saved to /var/cache/conftool/dbconfig/20220721-104039-marostegui.json [10:43:30] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [10:43:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P31612 and previous config saved to /var/cache/conftool/dbconfig/20220721-104351-ladsgroup.json [10:45:56] (03CR) 10Jbond: [C: 03+1] prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [10:46:00] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on kubetcd2006.codfw.wmnet with reason: Switch to DRBD, T311686 [10:46:06] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [10:46:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kubetcd2006.codfw.wmnet with reason: Switch to DRBD, T311686 [10:46:28] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [10:53:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre: pretty-format mediawiki.yaml [alerts] - 10https://gerrit.wikimedia.org/r/797314 (owner: 10Giuseppe Lavagetto) [10:54:51] (03Merged) 10jenkins-bot: sre: pretty-format mediawiki.yaml [alerts] - 10https://gerrit.wikimedia.org/r/797314 (owner: 10Giuseppe Lavagetto) [10:56:34] fyi i might run a little over my window [10:58:42] (03CR) 10Ayounsi: sre.network.debug: initial commit (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 (owner: 10Ayounsi) [10:58:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T312984)', diff saved to https://phabricator.wikimedia.org/P31613 and previous config saved to /var/cache/conftool/dbconfig/20220721-105856-ladsgroup.json [10:59:03] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [10:59:04] (03CR) 10Volans: [C: 03+1] "LGTM to start testing things and then once settles start to move things from __init__ to spicerack." [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [10:59:38] (03PS1) 10Mvolz: Fix package-lock for zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/815927 [11:01:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T312990)', diff saved to https://phabricator.wikimedia.org/P31614 and previous config saved to /var/cache/conftool/dbconfig/20220721-110126-marostegui.json [11:01:30] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [11:03:02] (03PS1) 10Marostegui: mariadb: Decommission db2078 [puppet] - 10https://gerrit.wikimedia.org/r/815928 (https://phabricator.wikimedia.org/T312754) [11:03:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2078.codfw.wmnet [11:03:39] (03CR) 10Mvolz: [C: 03+2] Fix package-lock for zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/815927 (owner: 10Mvolz) [11:06:41] (03Merged) 10jenkins-bot: Fix package-lock for zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/815927 (owner: 10Mvolz) [11:07:33] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [11:07:38] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [11:07:54] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:08:10] (03CR) 10David Caro: wmcs: don't page for most checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro) [11:08:15] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:08:18] (03CR) 10David Caro: [C: 03+2] wmcs: don't page for most checks [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro) [11:08:55] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:09:19] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [11:10:00] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [11:11:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2078 [puppet] - 10https://gerrit.wikimedia.org/r/815928 (https://phabricator.wikimedia.org/T312754) (owner: 10Marostegui) [11:13:48] (03PS1) 10Marostegui: wmnet: Replace db2160 with db2078 [dns] - 10https://gerrit.wikimedia.org/r/815933 (https://phabricator.wikimedia.org/T311493) [11:14:10] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2006.codfw.wmnet with reason: Switch instance to plain disk storage, T311686 [11:14:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2006.codfw.wmnet with reason: Switch instance to plain disk storage, T311686 [11:14:14] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [11:14:15] (03PS2) 10Marostegui: wmnet: Replace db2078 with db2160 [dns] - 10https://gerrit.wikimedia.org/r/815933 (https://phabricator.wikimedia.org/T311493) [11:15:17] (03CR) 10Marostegui: [C: 03+2] wmnet: Replace db2078 with db2160 [dns] - 10https://gerrit.wikimedia.org/r/815933 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [11:16:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P31615 and previous config saved to /var/cache/conftool/dbconfig/20220721-111631-marostegui.json [11:16:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:16:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2078.codfw.wmnet [11:17:19] 10ops-codfw, 10decommission-hardware: decommission db2078 - https://phabricator.wikimedia.org/T312754 (10Marostegui) @Papaul this is ready for you [11:17:42] 10ops-codfw, 10decommission-hardware: decommission db2078 - https://phabricator.wikimedia.org/T312754 (10Marostegui) a:03Papaul [11:18:25] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/815938 (owner: 10L10n-bot) [11:21:29] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:26:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/812818 (owner: 10Slyngshede) [11:30:16] (03PS2) 10Kevin Bazira: ml-services: Add arwiki articletopic isvc to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/815911 (https://phabricator.wikimedia.org/T313307) [11:31:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P31616 and previous config saved to /var/cache/conftool/dbconfig/20220721-113136-marostegui.json [11:31:45] (03CR) 10Kevin Bazira: ml-services: Add arwiki articletopic isvc to staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/815911 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [11:32:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:40:51] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_cirrus_build_completion_indices_codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:00] (03CR) 10Klausman: [C: 03+2] ml-services: Add arwiki articletopic isvc to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/815911 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [11:46:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T312990)', diff saved to https://phabricator.wikimedia.org/P31617 and previous config saved to /var/cache/conftool/dbconfig/20220721-114641-marostegui.json [11:46:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [11:46:48] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [11:46:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [11:47:29] (03Merged) 10jenkins-bot: ml-services: Add arwiki articletopic isvc to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/815911 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [11:48:31] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:48:44] 10SRE, 10Data-Engineering, 10Discovery: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10Tarrow) I think from our (people keeping an eye on Wikibase releases) side it would be helpful to keep both 0.3.40 and 0.3.97. Other than these we won't be imp... [11:55:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [11:56:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [11:56:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T312990)', diff saved to https://phabricator.wikimedia.org/P31618 and previous config saved to /var/cache/conftool/dbconfig/20220721-115607-marostegui.json [11:56:11] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [11:58:07] (03CR) 10Jbond: [C: 03+2] sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [11:58:13] (03PS46) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [11:58:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [11:59:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [11:59:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 9 hosts with reason: Maintenance [11:59:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 9 hosts with reason: Maintenance [12:03:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 18:00:00 on db2094.codfw.wmnet with reason: Maintenance [12:03:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 18:00:00 on db2094.codfw.wmnet with reason: Maintenance [12:05:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T312990)', diff saved to https://phabricator.wikimedia.org/P31619 and previous config saved to /var/cache/conftool/dbconfig/20220721-120553-marostegui.json [12:05:59] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [12:07:16] (03CR) 10Filippo Giunchedi: "Idea LGTM (see inline for comments), though I'm wondering if there are tcp/http services we can expect to be running at those addresses an" [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [12:07:21] !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:11:17] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10fgiunchedi) What @MoritzMuehlenhoff said (though we'll be upgrading to 2.26). At any rate `object-expirer` will remove the actual... [12:17:25] (03PS4) 10David Caro: prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910 [12:17:27] (03CR) 10David Caro: prometheus: Add icmp blackbox check (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [12:18:44] (03CR) 10CI reject: [V: 04-1] prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [12:20:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P31620 and previous config saved to /var/cache/conftool/dbconfig/20220721-122058-marostegui.json [12:23:36] (03PS5) 10David Caro: prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910 [12:29:09] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) >>! In T307184#8088084, @Jgiannelos wrote: > Eqiad is not serving live traffic at the moment. We need to re-import pl... [12:36:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P31621 and previous config saved to /var/cache/conftool/dbconfig/20220721-123603-marostegui.json [12:36:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:36:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:40:31] (03PS3) 10David Caro: wmcs: don't page for most checks [puppet] - 10https://gerrit.wikimedia.org/r/813267 [12:41:21] (03CR) 10David Caro: wmcs: don't page for most checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro) [12:49:49] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:51:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T312990)', diff saved to https://phabricator.wikimedia.org/P31622 and previous config saved to /var/cache/conftool/dbconfig/20220721-125108-marostegui.json [12:51:13] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [12:53:23] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:56:07] (03CR) 10Lucas Werkmeister (WMDE): Configure wbsearchentities profile parameter on Test Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806930 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [12:59:27] (03CR) 10David Caro: [C: 03+2] wmcs: don't page for most checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro) [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T1300). [13:00:05] Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:38] I can deploy! [13:01:33] (03PS2) 10Lucas Werkmeister (WMDE): Configure wbsearchentities profile parameter on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806930 (https://phabricator.wikimedia.org/T307869) [13:02:54] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Francesco Negri - https://phabricator.wikimedia.org/T313504 (10fnegri) [13:05:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Configure wbsearchentities profile parameter on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806930 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [13:05:59] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:04] (03Merged) 10jenkins-bot: Configure wbsearchentities profile parameter on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806930 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [13:06:42] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [13:08:07] pulled to mwdebug1001, testing [13:09:12] 😬 [13:09:13] Caught exception of type CirrusSearch\\Profile\\SearchProfileException [13:09:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] php_exporter: only export the proper php version [puppet] - 10https://gerrit.wikimedia.org/r/815755 (owner: 10Giuseppe Lavagetto) [13:09:53] seems to happen regardless of which profile parameter I specify (or don’t specify) [13:10:01] * Lucas_WMDE looks at logstash [13:10:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:10:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:10:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31623 and previous config saved to /var/cache/conftool/dbconfig/20220721-131040-ladsgroup.json [13:10:44] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [13:13:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:14:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:14:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:14:41] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:14:43] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Cmjohnson) [13:14:43] I’ve found the mistake in my code [13:14:44] reverting [13:14:48] we can retry later [13:14:49] 10SRE, 10Infrastructure-Foundations, 10netops: Move asw2-d5-eqiad to spares - https://phabricator.wikimedia.org/T313115 (10Cmjohnson) [13:14:52] but this will definitely need a backport [13:14:56] !log installing xen security updates [13:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:15:21] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Cmjohnson) 05Open→03Resolved Updated netbox [13:15:27] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Configure wbsearchentities profile parameter on Test Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815969 (https://phabricator.wikimedia.org/T307869) [13:15:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Configure wbsearchentities profile parameter on Test Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815969 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [13:16:17] (03Merged) 10jenkins-bot: Revert "Configure wbsearchentities profile parameter on Test Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815969 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [13:17:41] revert pulled to mwdebug1001 [13:18:01] seems to be working again [13:18:08] \o/ [13:18:20] nothing to scap, I think, but I guess I should leave a log message anyways to avoid confusion [13:19:12] !log pulled config change Iee6de25983 to mwdebug1001, then reverted in I9248270621 and pulled that too; neither was synced to other hosts [13:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:21:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:21:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:21:19] !log installing paramiko security updates [13:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:22:28] if there’s nothing else going on, I’d like to test my fix for the issue on mwdebug1001 [13:26:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on db1123.eqiad.wmnet with reason: Maintenance [13:26:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1123.eqiad.wmnet with reason: Maintenance [13:26:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T312990)', diff saved to https://phabricator.wikimedia.org/P31624 and previous config saved to /var/cache/conftool/dbconfig/20220721-132639-marostegui.json [13:26:43] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [13:26:49] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [13:27:07] PROBLEM - Host cuminunpriv1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:27:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31625 and previous config saved to /var/cache/conftool/dbconfig/20220721-132746-root.json [13:28:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:28:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:28:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:28:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:28:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T312990)', diff saved to https://phabricator.wikimedia.org/P31626 and previous config saved to /var/cache/conftool/dbconfig/20220721-132824-marostegui.json [13:28:45] PROBLEM - Host ganeti1020 is DOWN: PING CRITICAL - Packet loss = 100% [13:28:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:28:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) @papaul or @RobH I don't know what I am doing wrong with cloudnet1006, the installer fails fairly early in the process. There is a c... [13:31:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:33:20] (03PS1) 10Alexandros Kosiaris: services_proxy: Move AF common stanza to separate template [puppet] - 10https://gerrit.wikimedia.org/r/815957 [13:33:22] (03PS1) 10Alexandros Kosiaris: services_proxy: Allow having both v4 and v6 AF enabled [puppet] - 10https://gerrit.wikimedia.org/r/815958 [13:33:24] (03PS1) 10Alexandros Kosiaris: services_proxy: Listen on :: and not ::1 [puppet] - 10https://gerrit.wikimedia.org/r/815959 (https://phabricator.wikimedia.org/T255568) [13:36:13] (03CR) 10CI reject: [V: 04-1] services_proxy: Move AF common stanza to separate template [puppet] - 10https://gerrit.wikimedia.org/r/815957 (owner: 10Alexandros Kosiaris) [13:36:32] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36337/console" [puppet] - 10https://gerrit.wikimedia.org/r/815957 (owner: 10Alexandros Kosiaris) [13:37:15] (03CR) 10CI reject: [V: 04-1] services_proxy: Allow having both v4 and v6 AF enabled [puppet] - 10https://gerrit.wikimedia.org/r/815958 (owner: 10Alexandros Kosiaris) [13:37:46] (03CR) 10CI reject: [V: 04-1] services_proxy: Listen on :: and not ::1 [puppet] - 10https://gerrit.wikimedia.org/r/815959 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [13:39:28] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqiad_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T312990)', diff saved to https://phabricator.wikimedia.org/P31627 and previous config saved to /var/cache/conftool/dbconfig/20220721-134008-marostegui.json [13:40:14] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [13:41:13] alright, I’ll temporarily revert the revert of my bad config change, pull that to mwdebug1001, manually apply my fix (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/815961) on mwdebug1001, and then see if it works better [13:41:38] nobody else try to sync from deploy1002 during that time, please, we don’t want the bad config to go out everywhere ^^ [13:42:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host db1189.mgmt.eqiad.wmnet with reboot policy FORCED [13:42:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host db1191.mgmt.eqiad.wmnet with reboot policy FORCED [13:42:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host db1186.mgmt.eqiad.wmnet with reboot policy FORCED [13:42:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host db1188.mgmt.eqiad.wmnet with reboot policy FORCED [13:42:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host db1187.mgmt.eqiad.wmnet with reboot policy FORCED [13:42:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host db1185.mgmt.eqiad.wmnet with reboot policy FORCED [13:42:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host db1190.mgmt.eqiad.wmnet with reboot policy FORCED [13:42:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host db1193.mgmt.eqiad.wmnet with reboot policy FORCED [13:42:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host db1192.mgmt.eqiad.wmnet with reboot policy FORCED [13:42:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31628 and previous config saved to /var/cache/conftool/dbconfig/20220721-134250-root.json [13:43:58] okay, that seems to be working better [13:44:15] (03CR) 10Abijeet Patro: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/815938 (owner: 10L10n-bot) [13:44:53] (03PS6) 10David Caro: prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910 [13:44:55] (03CR) 10David Caro: prometheus: Add icmp blackbox check (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [13:45:02] restored config to master and pulled mwdebug1001 again [13:45:05] I think I’m done :) [13:45:14] !log UTC afternoon backport+config window done [13:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:24] PROBLEM - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:45:44] (03CR) 10CI reject: [V: 04-1] prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [13:47:34] (03PS1) 10Lucas Werkmeister (WMDE): Configure wbsearchentities profile parameter on Test Wikidata (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815970 (https://phabricator.wikimedia.org/T307869) [13:48:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "Don’t deploy until I0f9b688140 is actually live in production, either as part of the train (hopefully wmf.22) or backported." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815970 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [13:50:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31629 and previous config saved to /var/cache/conftool/dbconfig/20220721-135008-ladsgroup.json [13:50:13] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [13:54:26] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:55:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P31630 and previous config saved to /var/cache/conftool/dbconfig/20220721-135513-marostegui.json [13:57:02] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36338/console" [puppet] - 10https://gerrit.wikimedia.org/r/815958 (owner: 10Alexandros Kosiaris) [13:58:11] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36339/console" [puppet] - 10https://gerrit.wikimedia.org/r/815959 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [13:58:36] (03PS1) 10Cmjohnson: Adding new db servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/815986 (https://phabricator.wikimedia.org/T306928) [13:58:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1195.mgmt.eqiad.wmnet with reboot policy FORCED [13:59:01] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1189.mgmt.eqiad.wmnet with reboot policy FORCED [13:59:03] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1194.mgmt.eqiad.wmnet with reboot policy FORCED [13:59:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1193.mgmt.eqiad.wmnet with reboot policy FORCED [13:59:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1186.mgmt.eqiad.wmnet with reboot policy FORCED [13:59:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1188.mgmt.eqiad.wmnet with reboot policy FORCED [13:59:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1190.mgmt.eqiad.wmnet with reboot policy FORCED [13:59:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1191.mgmt.eqiad.wmnet with reboot policy FORCED [13:59:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1187.mgmt.eqiad.wmnet with reboot policy FORCED [13:59:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1192.mgmt.eqiad.wmnet with reboot policy FORCED [13:59:07] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1185.mgmt.eqiad.wmnet with reboot policy FORCED [13:59:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [13:59:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [14:00:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T312863)', diff saved to https://phabricator.wikimedia.org/P31631 and previous config saved to /var/cache/conftool/dbconfig/20220721-140004-ladsgroup.json [14:00:09] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [14:00:35] RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:02:09] (03CR) 10Cmjohnson: [C: 03+2] Adding new db servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/815986 (https://phabricator.wikimedia.org/T306928) (owner: 10Cmjohnson) [14:04:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31632 and previous config saved to /var/cache/conftool/dbconfig/20220721-140434-root.json [14:05:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P31633 and previous config saved to /var/cache/conftool/dbconfig/20220721-140513-ladsgroup.json [14:05:48] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/815822 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [14:08:07] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:09:29] (03PS7) 10David Caro: prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910 [14:10:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P31634 and previous config saved to /var/cache/conftool/dbconfig/20220721-141018-marostegui.json [14:11:31] (03PS1) 10Giuseppe Lavagetto: prometheus::php_fpm_exporter: allow absenting the class [puppet] - 10https://gerrit.wikimedia.org/r/815989 (https://phabricator.wikimedia.org/T313505) [14:11:33] (03PS1) 10Giuseppe Lavagetto: prometheus::php_fpm_exporter: absent in production [puppet] - 10https://gerrit.wikimedia.org/r/815990 (https://phabricator.wikimedia.org/T313505) [14:11:35] (03PS1) 10Giuseppe Lavagetto: prometheus::php_fpm_exporter: remove from puppet [puppet] - 10https://gerrit.wikimedia.org/r/815991 (https://phabricator.wikimedia.org/T313505) [14:11:37] (03PS8) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [14:12:48] (03CR) 10Eevans: Merge Cassandra 3.11.13 configuration changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/815822 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [14:13:28] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36343/console" [puppet] - 10https://gerrit.wikimedia.org/r/815989 (https://phabricator.wikimedia.org/T313505) (owner: 10Giuseppe Lavagetto) [14:14:00] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] prometheus::php_fpm_exporter: allow absenting the class [puppet] - 10https://gerrit.wikimedia.org/r/815989 (https://phabricator.wikimedia.org/T313505) (owner: 10Giuseppe Lavagetto) [14:14:52] (03PS2) 10Ssingh: dnsdist: add CAP_BPF to systemd override for eBPF support [puppet] - 10https://gerrit.wikimedia.org/r/784272 [14:15:05] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36344/console" [puppet] - 10https://gerrit.wikimedia.org/r/815990 (https://phabricator.wikimedia.org/T313505) (owner: 10Giuseppe Lavagetto) [14:15:34] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36345/console" [puppet] - 10https://gerrit.wikimedia.org/r/784272 (owner: 10Ssingh) [14:16:33] (03PS2) 10Alexandros Kosiaris: services_proxy: Move AF common stanza to separate template [puppet] - 10https://gerrit.wikimedia.org/r/815957 [14:16:35] (03PS2) 10Alexandros Kosiaris: services_proxy: Allow having both v4 and v6 AF enabled [puppet] - 10https://gerrit.wikimedia.org/r/815958 [14:16:37] (03PS2) 10Alexandros Kosiaris: services_proxy: Listen on :: and not ::1 [puppet] - 10https://gerrit.wikimedia.org/r/815959 (https://phabricator.wikimedia.org/T255568) [14:17:26] (03CR) 10David Caro: prometheus: Add icmp blackbox check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [14:17:40] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36346/console" [puppet] - 10https://gerrit.wikimedia.org/r/815991 (https://phabricator.wikimedia.org/T313505) (owner: 10Giuseppe Lavagetto) [14:17:53] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1194.eqiad.wmnet with OS bullseye [14:17:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1188.eqiad.wmnet with OS bullseye [14:17:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1186.eqiad.wmnet with OS bullseye [14:17:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1193.eqiad.wmnet with OS bullseye [14:17:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1189.eqiad.wmnet with OS bullseye [14:17:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1190.eqiad.wmnet with OS bullseye [14:17:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1192.eqiad.wmnet with OS bullseye [14:17:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1187.eqiad.wmnet with OS bullseye [14:17:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS bullseye [14:17:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1191.eqiad.wmnet with OS bullseye [14:17:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host db1195.eqiad.wmnet with OS bullseye [14:18:02] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1194.eqiad.wmnet with... [14:18:06] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1188.eqiad.wmnet with... [14:18:12] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1186.eqiad.wmnet with... [14:18:22] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1189.eqiad.wmnet with... [14:18:28] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1193.eqiad.wmnet with... [14:18:38] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1190.eqiad.wmnet with... [14:18:44] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1187.eqiad.wmnet with... [14:18:50] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1192.eqiad.wmnet with... [14:18:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1191.eqiad.wmnet with... [14:19:04] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36347/console" [puppet] - 10https://gerrit.wikimedia.org/r/815957 (owner: 10Alexandros Kosiaris) [14:19:06] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1185.eqiad.wmnet with... [14:19:12] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host db1195.eqiad.wmnet with... [14:19:36] (03CR) 10CI reject: [V: 04-1] services_proxy: Move AF common stanza to separate template [puppet] - 10https://gerrit.wikimedia.org/r/815957 (owner: 10Alexandros Kosiaris) [14:19:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31635 and previous config saved to /var/cache/conftool/dbconfig/20220721-141938-root.json [14:20:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P31636 and previous config saved to /var/cache/conftool/dbconfig/20220721-142019-ladsgroup.json [14:22:03] (03CR) 10MVernon: [C: 03+2] "OK, agree we don't want to be carrying diff from upstream here." [puppet] - 10https://gerrit.wikimedia.org/r/815822 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [14:22:52] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1189.eqiad.wmnet with OS bullseye [14:22:55] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1193.eqiad.wmnet with OS bullseye [14:22:57] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1190.eqiad.wmnet with OS bullseye [14:22:59] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1186.eqiad.wmnet with OS bullseye [14:22:59] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1194.eqiad.wmnet with OS bullseye [14:22:59] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1185.eqiad.wmnet with OS bullseye [14:22:59] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1191.eqiad.wmnet with OS bullseye [14:22:59] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1188.eqiad.wmnet with OS bullseye [14:22:59] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1195.eqiad.wmnet with OS bullseye [14:22:59] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1189.eqiad.wmnet with OS... [14:23:00] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1187.eqiad.wmnet with OS bullseye [14:23:00] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1192.eqiad.wmnet with OS bullseye [14:23:02] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1193.eqiad.wmnet with OS... [14:23:05] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1190.eqiad.wmnet with OS... [14:23:11] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1186.eqiad.wmnet with OS... [14:23:15] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1194.eqiad.wmnet with OS... [14:23:21] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1185.eqiad.wmnet with OS... [14:23:27] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1191.eqiad.wmnet with OS... [14:23:33] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1195.eqiad.wmnet with OS... [14:23:39] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1188.eqiad.wmnet with OS... [14:23:45] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1187.eqiad.wmnet with OS... [14:23:51] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host db1192.eqiad.wmnet with OS... [14:24:19] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:25:15] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Cmjohnson) @Jclark-ctr These all failed because the cable is not connected, can you please verify that you connected the 1G... [14:25:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T312990)', diff saved to https://phabricator.wikimedia.org/P31637 and previous config saved to /var/cache/conftool/dbconfig/20220721-142523-marostegui.json [14:25:28] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [14:27:00] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Cmjohnson) [14:28:54] (03CR) 10David Caro: novafullstack: remove leaked VMs test, moved to alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813275 (owner: 10David Caro) [14:29:24] (03PS4) 10David Caro: novafullstack: remove leaked VMs test, moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/813275 [14:31:29] (03PS3) 10David Caro: tests: Add nice message to runbook check test failure [alerts] - 10https://gerrit.wikimedia.org/r/815238 [14:34:13] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:35:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31638 and previous config saved to /var/cache/conftool/dbconfig/20220721-143524-ladsgroup.json [14:35:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1101.eqiad.wmnet with reason: Maintenance [14:35:30] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [14:35:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1101.eqiad.wmnet with reason: Maintenance [14:35:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31639 and previous config saved to /var/cache/conftool/dbconfig/20220721-143544-ladsgroup.json [14:36:19] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:36:47] (03PS1) 10Volans: tools/ganeti-netbox-sync: increase timeout [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/815997 [14:39:10] (03CR) 10David Caro: tests: Add nice message to runbook check test failure [alerts] - 10https://gerrit.wikimedia.org/r/815238 (owner: 10David Caro) [14:39:15] (03CR) 10David Caro: [C: 03+2] tests: Add nice message to runbook check test failure [alerts] - 10https://gerrit.wikimedia.org/r/815238 (owner: 10David Caro) [14:39:25] !log mvernon@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs: merging upstream config changes T309896 - mvernon@cumin1001 [14:39:30] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [14:41:57] (03Merged) 10jenkins-bot: tests: Add nice message to runbook check test failure [alerts] - 10https://gerrit.wikimedia.org/r/815238 (owner: 10David Caro) [14:42:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/815997 (owner: 10Volans) [14:43:45] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [14:45:01] (03CR) 10Filippo Giunchedi: "See inline, one more comment/fix and I think this is good to go, thanks David!" [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [14:45:15] !log upgrading ganeti/eqsin to 3.0.2 T312637 [14:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:19] T312637: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 [14:47:23] (03CR) 10Volans: [C: 03+2] tools/ganeti-netbox-sync: increase timeout [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/815997 (owner: 10Volans) [14:48:30] (03Merged) 10jenkins-bot: tools/ganeti-netbox-sync: increase timeout [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/815997 (owner: 10Volans) [14:49:10] jouncebot: now [14:49:10] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [14:49:23] (03PS1) 10Lucas Werkmeister (WMDE): Fix profile in wbsearchentities and wbsearch [extensions/Wikibase] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815983 (https://phabricator.wikimedia.org/T307869) [14:49:32] I’ll deploy ^ this backport and then retry my config change from earlier, if that’s alright with everyone [14:50:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "backporting" [extensions/Wikibase] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815983 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [14:54:27] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:04] (03CR) 10BryanDavis: rabbitmq: Add SPDX headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802565 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:56:55] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Francesco Negri - https://phabricator.wikimedia.org/T313504 (10nskaggs) +1, approve. [14:58:03] (03PS1) 10RLazarus: varnish: If X-Requestctl is unset, don't append it to X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/816000 [15:01:18] (03CR) 10RLazarus: "The alternative approach is to leave this line as is, but unset X-Requestctl if empty at the end of the requestctl flow -- in that case th" [puppet] - 10https://gerrit.wikimedia.org/r/816000 (owner: 10RLazarus) [15:02:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Papaul) @Cmjohnson if there is a current OS on it and was not finalized with puppet, try to re-run the cookbook with the --no-pxe --new flags. [15:03:45] RECOVERY - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:03:46] (03PS1) 10BryanDavis: rabbitmq: Fix SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816001 (https://phabricator.wikimedia.org/T308013) [15:05:14] (03CR) 10Andrew Bogott: [C: 03+2] Put cloudweb100[34] into service [puppet] - 10https://gerrit.wikimedia.org/r/815378 (https://phabricator.wikimedia.org/T305414) (owner: 10Andrew Bogott) [15:05:21] (03PS3) 10Andrew Bogott: Put cloudweb100[34] into service [puppet] - 10https://gerrit.wikimedia.org/r/815378 (https://phabricator.wikimedia.org/T305414) [15:08:25] (03PS8) 10David Caro: prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910 [15:08:36] (03CR) 10David Caro: prometheus: Add icmp blackbox check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [15:09:07] (03Merged) 10jenkins-bot: Fix profile in wbsearchentities and wbsearch [extensions/Wikibase] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815983 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [15:10:29] (03Abandoned) 10David Caro: dumps.kiwix:Add pidfile to manage multiple runs [puppet] - 10https://gerrit.wikimedia.org/r/815310 (owner: 10David Caro) [15:10:32] quickly checking on mwdebug1001 (should be a no-op) [15:11:24] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2014.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [15:11:30] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [15:11:36] syncing [15:11:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2014.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [15:13:03] !log draining ganeti2021 T310483 [15:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:40] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.21/extensions/Wikibase/repo/: Backport: [[gerrit:815983|Fix profile in wbsearchentities and wbsearch (T307869)]] (duration: 03m 07s) [15:14:44] T307869: Request for new search profile for Wikidata that boosts Items for languages - https://phabricator.wikimedia.org/T307869 [15:15:19] (03CR) 10Lucas Werkmeister (WMDE): Configure wbsearchentities profile parameter on Test Wikidata (take 2) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815970 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [15:15:21] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [15:15:28] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Configure wbsearchentities profile parameter on Test Wikidata (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815970 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [15:15:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:16:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:16:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:16:38] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudweb1003.wikimedia.org with OS buster [15:16:39] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudweb1004.wikimedia.org with OS buster [15:16:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudweb1003.wikimedia.... [15:16:50] (03Merged) 10jenkins-bot: Configure wbsearchentities profile parameter on Test Wikidata (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815970 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [15:16:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudweb1004.wikimedia.... [15:16:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:17:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Andrew) I'm reimaging these with Buster because mediawiki isn't really supported on Bullseye yet. [15:17:19] pulled my config change to mwdebug1001, testing there [15:17:44] no errors this time \o/ [15:18:03] should be fine to sync in either order but I’ll do IS.php first [15:18:38] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Francesco Negri - https://phabricator.wikimedia.org/T313504 (10dcaro) [15:19:12] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Francesco Negri - https://phabricator.wikimedia.org/T313504 (10dcaro) [15:19:15] (and no change on real Wikidata, as expected) [15:19:48] (03CR) 10Zabe: "Thanks for the catch" [puppet] - 10https://gerrit.wikimedia.org/r/816001 (https://phabricator.wikimedia.org/T308013) (owner: 10BryanDavis) [15:20:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31640 and previous config saved to /var/cache/conftool/dbconfig/20220721-152007-ladsgroup.json [15:20:12] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [15:21:22] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:815970|Configure wbsearchentities profile parameter on Test Wikidata (take 2) (T307869)]] (1/2) (duration: 02m 59s) [15:21:26] T307869: Request for new search profile for Wikidata that boosts Items for languages - https://phabricator.wikimedia.org/T307869 [15:21:29] (03CR) 10David Caro: [C: 03+2] prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [15:22:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:22:06] 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Krinkle) [15:22:16] (03CR) 10David Caro: [C: 03+2] prometheus: Add icmp blackbox check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815910 (owner: 10David Caro) [15:23:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:23:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:23:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:25:23] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/SearchSettingsForWikibase.php: Config: [[gerrit:815970|Configure wbsearchentities profile parameter on Test Wikidata (take 2) (T307869)]] (2/2) (duration: 03m 13s) [15:26:24] alright, I’m done deploying [15:29:39] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudweb1003.wikimedia.org with reason: host reimage [15:29:43] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudweb1004.wikimedia.org with reason: host reimage [15:29:53] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) The reason ratelimiting via `tasks_per_second` was introduced (per the [[ https://bugs.launchpad.net/swift/+bug/1784753 | bu... [15:32:04] (03PS1) 10Jbond: rabbitmq: correct SPDX licence tag [puppet] - 10https://gerrit.wikimedia.org/r/816004 [15:32:31] (03CR) 10Jbond: [C: 03+2] rabbitmq: correct SPDX licence tag [puppet] - 10https://gerrit.wikimedia.org/r/816004 (owner: 10Jbond) [15:33:10] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudweb1003.wikimedia.org with reason: host reimage [15:33:15] (03CR) 10Jbond: rabbitmq: Add SPDX headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802565 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [15:34:42] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudweb1004.wikimedia.org with reason: host reimage [15:35:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P31641 and previous config saved to /var/cache/conftool/dbconfig/20220721-153512-ladsgroup.json [15:39:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T312863)', diff saved to https://phabricator.wikimedia.org/P31642 and previous config saved to /var/cache/conftool/dbconfig/20220721-153904-ladsgroup.json [15:39:08] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [15:46:11] (03PS1) 10David Caro: wmcs: some yaml autoformatting [puppet] - 10https://gerrit.wikimedia.org/r/816005 [15:46:13] (03PS1) 10David Caro: ceph:osd: add support for multi-network setup [puppet] - 10https://gerrit.wikimedia.org/r/816006 (https://phabricator.wikimedia.org/T309209) [15:48:53] (03CR) 10Ryan Kemper: [C: 03+2] elastic: prep to bring elastic20[64-72] in [puppet] - 10https://gerrit.wikimedia.org/r/815823 (https://phabricator.wikimedia.org/T300943) (owner: 10Ryan Kemper) [15:49:29] (03CR) 10CI reject: [V: 04-1] ceph:osd: add support for multi-network setup [puppet] - 10https://gerrit.wikimedia.org/r/816006 (https://phabricator.wikimedia.org/T309209) (owner: 10David Caro) [15:50:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P31643 and previous config saved to /var/cache/conftool/dbconfig/20220721-155017-ladsgroup.json [15:50:41] !log T300943 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/815823 and running puppet across elastic2* in preparation for adding new codfw hosts into service [15:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:45] T300943: Service implementation for elastic20[61-86].codfw.wmnet - https://phabricator.wikimedia.org/T300943 [15:54:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P31644 and previous config saved to /var/cache/conftool/dbconfig/20220721-155409-ladsgroup.json [15:58:03] (03CR) 10Dduvall: docker_registry_ha: Authorize GitLab trusted runners using JWT (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [15:58:29] ACKNOWLEDGEMENT - MD RAID on cloudweb1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.150. Check system logs on 208.80.154.150 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T313520 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:58:40] 10SRE, 10ops-eqiad: Degraded RAID on cloudweb1003 - https://phabricator.wikimedia.org/T313520 (10ops-monitoring-bot) [16:00:05] jbond and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T1600) [16:00:05] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:13] o/ [16:00:15] dancy: hello! I'm on time today :D looking [16:00:26] :-) [16:01:41] (03PS1) 10Ryan Kemper: elastic: enable elastic20[64-72] cirrus roles [puppet] - 10https://gerrit.wikimedia.org/r/816008 (https://phabricator.wikimedia.org/T300943) [16:01:51] dancy: any testing to do? or just roll it out everywhere and test it with the train later? [16:02:04] (03CR) 10RLazarus: [C: 03+2] scap.cfg.erb: Set gerrit_push_user: trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/815329 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [16:02:26] Just roll it out. We did manual testing of the change earlier in the week. [16:02:35] ah perfect 👍 all set, then! [16:03:04] Thanks! [16:05:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31645 and previous config saved to /var/cache/conftool/dbconfig/20220721-160522-ladsgroup.json [16:05:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1171.eqiad.wmnet with reason: Maintenance [16:05:27] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [16:05:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1171.eqiad.wmnet with reason: Maintenance [16:08:19] (03PS2) 10Ryan Kemper: elastic: enable elastic20[64-72] cirrus roles [puppet] - 10https://gerrit.wikimedia.org/r/816008 (https://phabricator.wikimedia.org/T300943) [16:08:36] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudweb1003.wikimedia.org with OS buster [16:08:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudweb1003.wikimedia.org with OS buster complet... [16:09:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P31646 and previous config saved to /var/cache/conftool/dbconfig/20220721-160914-ladsgroup.json [16:09:51] !log T300943 Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/816008 and running puppet twice on elastic20[64-72] [16:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:57] T300943: Service implementation for elastic20[61-86].codfw.wmnet - https://phabricator.wikimedia.org/T300943 [16:09:57] (03CR) 10Ryan Kemper: [C: 03+2] elastic: enable elastic20[64-72] cirrus roles [puppet] - 10https://gerrit.wikimedia.org/r/816008 (https://phabricator.wikimedia.org/T300943) (owner: 10Ryan Kemper) [16:10:16] (03PS1) 10Zabe: rabbitmq: fix version of MPL license in spdx header [puppet] - 10https://gerrit.wikimedia.org/r/816010 [16:12:42] (03CR) 10Zabe: rabbitmq: correct SPDX licence tag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816004 (owner: 10Jbond) [16:12:56] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudweb1004.wikimedia.org with OS buster [16:13:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudweb1004.wikimedia.org with OS buster complet... [16:13:07] (03CR) 10Zabe: "was now fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/816004" [puppet] - 10https://gerrit.wikimedia.org/r/816001 (https://phabricator.wikimedia.org/T308013) (owner: 10BryanDavis) [16:17:56] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/816012 [16:19:57] 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Krinkle) >>! @jcrespo wrote in the task description: > * Should we implement a more deterministic sharding... [16:20:48] (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/816012 (owner: 10Ahmon Dancy) [16:21:36] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/816012 (owner: 10Ahmon Dancy) [16:24:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T312863)', diff saved to https://phabricator.wikimedia.org/P31647 and previous config saved to /var/cache/conftool/dbconfig/20220721-162419-ladsgroup.json [16:24:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [16:24:25] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [16:24:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [16:24:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:24:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:24:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T312863)', diff saved to https://phabricator.wikimedia.org/P31648 and previous config saved to /var/cache/conftool/dbconfig/20220721-162458-ladsgroup.json [16:26:57] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:26:58] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:27:11] here (and got it through VO that time) [16:28:45] o/ [16:29:41] PROBLEM - Check systemd state on elastic2066 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:42] PROBLEM - Check systemd state on elastic2071 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:09] PROBLEM - Check systemd state on elastic2065 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:57] (Primary outbound port utilisation over 80% #page) resolved: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:31:58] (Primary outbound port utilisation over 80% #page) resolved: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:32:41] PROBLEM - Check systemd state on elastic2072 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:11] PROBLEM - Check systemd state on elastic2069 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:23] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:34:42] 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Krinkle) [16:35:23] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/814913 (https://phabricator.wikimedia.org/T309342) (owner: 10Andrew Bogott) [16:36:32] 10SRE, 10DBA, 10Epic, 10Patch-For-Review, and 2 others: Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10jcrespo) >>! In T133523#8095094, @Krinkle wrote: > source: etcd.php, for some reason) This is because this allows 100% hot es failovers, a... [16:36:45] PROBLEM - Check systemd state on elastic2067 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:55] 10SRE, 10DBA, 10Epic, 10Patch-For-Review, and 2 others: Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10jcrespo) [16:37:20] 10SRE, 10DBA, 10Epic, 10Patch-For-Review, and 2 others: Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10jcrespo) I explained on my previous comment why the labeling is not deterministic enough. [16:38:12] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2066.codfw.wmnet with OS bullseye [16:38:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:38:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:38:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31649 and previous config saved to /var/cache/conftool/dbconfig/20220721-163859-ladsgroup.json [16:39:03] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [16:41:53] RECOVERY - Check systemd state on elastic2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:41] RECOVERY - Check systemd state on elastic2071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:25] RECOVERY - Check systemd state on elastic2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:28] !log ryankemper@cumin1001 conftool action : set/weight=10; selector: name=elastic6* [16:44:31] RECOVERY - Check systemd state on elastic2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:21] RECOVERY - Check systemd state on elastic2072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:33] (03PS1) 10Brennen Bearnes: scap: target phab2001 for trial run [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/816016 [16:52:06] (03CR) 10Dzahn: [C: 04-1] "it's in codfw. so phab2001.codfw.wmnet" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/816016 (owner: 10Brennen Bearnes) [16:53:48] (03PS1) 10Ryan Kemper: elastic: add conftool entries for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/816017 (https://phabricator.wikimedia.org/T300943) [16:54:21] (03PS2) 10Brennen Bearnes: scap: target phab2001 for trial run [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/816016 [16:54:43] 10SRE, 10DBA, 10Epic, 10Patch-For-Review, and 2 others: Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10jcrespo) [16:54:55] (03CR) 10Brennen Bearnes: scap: target phab2001 for trial run (031 comment) [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/816016 (owner: 10Brennen Bearnes) [16:56:35] 10SRE, 10DBA, 10Epic, 10Patch-For-Review, and 2 others: Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10jcrespo) I have merged the "needs" for HA+sharding in a single bullet point so the language is clearer. Unless it is not clear, When a singl... [16:57:38] (03CR) 10Ryan Kemper: [C: 03+2] elastic: add conftool entries for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/816017 (https://phabricator.wikimedia.org/T300943) (owner: 10Ryan Kemper) [16:58:04] !log mvernon@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs: merging upstream config changes T309896 - mvernon@cumin1001 [16:58:09] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [16:58:23] (03PS11) 10Dduvall: docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) [16:58:25] (03PS1) 10Dduvall: jwt_authorizer: Provide microservice for JSON Web Token authorization [puppet] - 10https://gerrit.wikimedia.org/r/816018 (https://phabricator.wikimedia.org/T308501) [16:58:58] !log T300943 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/816017 to get conftool-data entries for new elastic2* hosts [16:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:03] T300943: Service implementation for elastic20[61-86].codfw.wmnet - https://phabricator.wikimedia.org/T300943 [17:00:02] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2066.codfw.wmnet with OS bullseye [17:00:05] bd808: May I have your attention please! Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T1700) [17:01:34] (03CR) 10Dduvall: "Thanks again for the review. I've split the patch into two parts: 1) provides installation of jwt-authorizer and the jwt_authorizer::servi" [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [17:04:10] (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova::compute::service: don't add 'nova' user to libvirt group [puppet] - 10https://gerrit.wikimedia.org/r/814913 (https://phabricator.wikimedia.org/T309342) (owner: 10Andrew Bogott) [17:06:32] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2022-07-21-111825-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/816019 [17:10:33] (03CR) 10Bking: [C: 03+2] apifeatureusage: Write using the _doc mapping type [puppet] - 10https://gerrit.wikimedia.org/r/815781 (https://phabricator.wikimedia.org/T313434) (owner: 10Ebernhardson) [17:11:58] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2022-07-21-111825-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/816019 (owner: 10BryanDavis) [17:15:18] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2022-07-21-111825-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/816019 (owner: 10BryanDavis) [17:17:05] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:17:30] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:17:43] 10SRE, 10DBA, 10Epic, 10Patch-For-Review, and 2 others: Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Krinkle) >>! In T133523#8095228, @jcrespo wrote: > I explained on my previous comment why the labeling is not deterministic enough. I do no... [17:19:20] (03PS3) 10Ryan Kemper: apifeatureusage: Adjust index template to use _doc mapping type [puppet] - 10https://gerrit.wikimedia.org/r/815782 (https://phabricator.wikimedia.org/T313434) (owner: 10Ebernhardson) [17:19:44] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:20:20] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:20:30] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:21:03] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:22:33] (03CR) 10Ryan Kemper: [C: 03+2] apifeatureusage: Adjust index template to use _doc mapping type [puppet] - 10https://gerrit.wikimedia.org/r/815782 (https://phabricator.wikimedia.org/T313434) (owner: 10Ebernhardson) [17:29:58] 10SRE, 10DBA, 10Epic, 10Patch-For-Review, and 2 others: Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10jcrespo) [17:30:45] !log ryankemper@cumin1001 conftool action : set/weight=10,pooled=yes; selector: name=elastic6* [17:31:24] (03CR) 10Dzahn: [C: 03+1] scap: target phab2001 for trial run [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/816016 (owner: 10Brennen Bearnes) [17:32:15] jouncebot nowandnext [17:32:15] For the next 0 hour(s) and 27 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T1700) [17:32:15] In 0 hour(s) and 27 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T1800) [17:32:45] 10SRE, 10DBA, 10Epic, 10Patch-For-Review, and 2 others: Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10jcrespo) I discussed with @Krinkle on IRC, as I believe we had a missunderstanding- consistent hashing was implemented by @aaron at https://... [17:33:33] 10SRE, 10DBA, 10Epic, 10Patch-For-Review, and 2 others: Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10jcrespo) [17:33:39] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:42] jouncebot: now [17:33:42] For the next 0 hour(s) and 26 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T1700) [17:34:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31650 and previous config saved to /var/cache/conftool/dbconfig/20220721-173458-ladsgroup.json [17:35:03] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [17:35:26] !log ryankemper@cumin1001 conftool action : GET; selector: name=elastic6* [17:36:07] !log dancy@deploy1002 Synchronized README: Gathering timing info (duration: 03m 09s) [17:37:31] 10SRE, 10Traffic: pontoon.traffic.eqiad1.wikimedia.cloud unable to run puppet agent due to certificate mismatch - https://phabricator.wikimedia.org/T310303 (10BCornwall) 05Open→03Resolved The instances have been replaced. [17:37:35] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] "Self-merging for dry run." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/816016 (owner: 10Brennen Bearnes) [17:41:17] !log ryankemper@cumin1001 conftool action : set/weight=10:pooled=yes; selector: name=elastic206[1-9].* [17:41:47] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:41:49] !log ryankemper@cumin1001 conftool action : set/weight=10:pooled=no; selector: name=elastic2066.codfw.wmnet [17:41:59] !log ryankemper@cumin1001 conftool action : set/weight=10:pooled=yes; selector: name=elastic207[0-2].* [17:44:52] 10SRE, 10Traffic: DRMRS: Geodns Configuration -- Phase 2 - https://phabricator.wikimedia.org/T311472 (10BCornwall) 05Open→03In progress [17:44:55] (03PS1) 10Ahmon Dancy: MWConfigCacheGenerator.php: Use grace period of 3 minutes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816022 (https://phabricator.wikimedia.org/T311788) [17:44:56] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BCornwall) [17:45:01] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:45:22] 10SRE, 10Traffic: DRMRS: Geodns Configuration -- Phase 2 - https://phabricator.wikimedia.org/T311472 (10BCornwall) p:05Triage→03Medium [17:46:50] (03PS2) 10David Caro: ceph:osd: add support for multi-network setup [puppet] - 10https://gerrit.wikimedia.org/r/816006 (https://phabricator.wikimedia.org/T309209) [17:48:25] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36349/console" [puppet] - 10https://gerrit.wikimedia.org/r/816006 (https://phabricator.wikimedia.org/T309209) (owner: 10David Caro) [17:49:36] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36348/mwdebug1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [17:50:03] (03CR) 10CI reject: [V: 04-1] ceph:osd: add support for multi-network setup [puppet] - 10https://gerrit.wikimedia.org/r/816006 (https://phabricator.wikimedia.org/T309209) (owner: 10David Caro) [17:50:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P31651 and previous config saved to /var/cache/conftool/dbconfig/20220721-175003-ladsgroup.json [17:50:24] (03CR) 10David Caro: [V: 03+1] "PCC looks good, 1034 had the routes manually added." [puppet] - 10https://gerrit.wikimedia.org/r/816006 (https://phabricator.wikimedia.org/T309209) (owner: 10David Caro) [17:50:49] (03PS3) 10David Caro: ceph:osd: add support for multi-network setup [puppet] - 10https://gerrit.wikimedia.org/r/816006 (https://phabricator.wikimedia.org/T309209) [17:51:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T312863)', diff saved to https://phabricator.wikimedia.org/P31652 and previous config saved to /var/cache/conftool/dbconfig/20220721-175147-ladsgroup.json [17:51:54] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [17:53:03] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on elastic2048 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [18:00:04] jeena and jnuche: May I have your attention please! MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T1800) [18:01:03] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:40] (03PS1) 10TrainBranchBot: all wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816024 (https://phabricator.wikimedia.org/T308074) [18:01:42] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816024 (https://phabricator.wikimedia.org/T308074) (owner: 10TrainBranchBot) [18:02:07] (03PS1) 10Dzahn: httpbb: delete broken test for apple search bridge [puppet] - 10https://gerrit.wikimedia.org/r/816025 [18:02:13] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on elastic2048 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} Ryan Kemper Trying reboot per https://wikitech.wikimedia.org/wiki/Microcode https://wikitech.wikimedia.org/wiki/Microcode [18:03:37] (03CR) 10CI reject: [V: 04-1] httpbb: delete broken test for apple search bridge [puppet] - 10https://gerrit.wikimedia.org/r/816025 (owner: 10Dzahn) [18:03:39] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816024 (https://phabricator.wikimedia.org/T308074) (owner: 10TrainBranchBot) [18:05:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P31653 and previous config saved to /var/cache/conftool/dbconfig/20220721-180508-ladsgroup.json [18:05:43] (03CR) 10Majavah: [C: 04-1] "This test isn't intended to be run against appservers, and isn't even in the appservers/ directory?" [puppet] - 10https://gerrit.wikimedia.org/r/816025 (owner: 10Dzahn) [18:06:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P31654 and previous config saved to /var/cache/conftool/dbconfig/20220721-180653-ladsgroup.json [18:06:54] (03CR) 10Dzahn: "the reason I said that is this:" [puppet] - 10https://gerrit.wikimedia.org/r/816025 (owner: 10Dzahn) [18:07:12] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.21 refs T308074 [18:07:16] T308074: 1.39.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T308074 [18:07:37] (03CR) 10Dzahn: httpbb: delete broken test for apple search bridge (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816025 (owner: 10Dzahn) [18:09:24] (03Abandoned) 10Dzahn: httpbb: delete broken test for apple search bridge [puppet] - 10https://gerrit.wikimedia.org/r/816025 (owner: 10Dzahn) [18:10:28] (03CR) 10Krinkle: [C: 03+1] MWConfigCacheGenerator.php: Use grace period of 3 minutes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816022 (https://phabricator.wikimedia.org/T311788) (owner: 10Ahmon Dancy) [18:10:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:10:57] !log creating tables for board election with bv2022_tables.sql [18:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:20] (03PS2) 10Ahmon Dancy: MWConfigCacheGenerator.php: Use grace period of 3 minutes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816022 (https://phabricator.wikimedia.org/T311788) [18:11:37] !log brennen@deploy1002 Started deploy [phabricator/deployment@358bb3a]: (no justification provided) [18:11:39] (03CR) 10Ahmon Dancy: "Fixed a typo in the commit message" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816022 (https://phabricator.wikimedia.org/T311788) (owner: 10Ahmon Dancy) [18:12:10] jouncebot nowandnext [18:12:10] For the next 1 hour(s) and 47 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T1800) [18:12:11] In 1 hour(s) and 47 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T2000) [18:12:27] PROBLEM - SSH on mw1324.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:12:54] !log brennen@deploy1002 Finished deploy [phabricator/deployment@358bb3a]: (no justification provided) (duration: 01m 17s) [18:14:01] !log testing scap deployment to phab2001, this is a no-op for production services [18:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:15:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:16:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:18:24] (03PS1) 10Majavah: P:mariadb::grants: add cloudweb1003/1004 grants [puppet] - 10https://gerrit.wikimedia.org/r/816026 (https://phabricator.wikimedia.org/T305414) [18:19:29] (03PS2) 10Majavah: P:mariadb::grants: add cloudweb1003/1004 grants [puppet] - 10https://gerrit.wikimedia.org/r/816026 (https://phabricator.wikimedia.org/T305414) [18:20:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31655 and previous config saved to /var/cache/conftool/dbconfig/20220721-182013-ladsgroup.json [18:20:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1174.eqiad.wmnet with reason: Maintenance [18:20:18] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [18:20:20] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [18:20:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1174.eqiad.wmnet with reason: Maintenance [18:20:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T312984)', diff saved to https://phabricator.wikimedia.org/P31656 and previous config saved to /var/cache/conftool/dbconfig/20220721-182033-ladsgroup.json [18:21:49] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on elastic2048 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [18:21:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P31657 and previous config saved to /var/cache/conftool/dbconfig/20220721-182158-ladsgroup.json [18:21:59] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [18:22:07] (03PS1) 10Brennen Bearnes: scap: remove tag plugin & asciitable dependency [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/816027 (https://phabricator.wikimedia.org/T313259) [18:23:01] (03CR) 10Andrew Bogott: [C: 03+1] P:mariadb::grants: add cloudweb1003/1004 grants [puppet] - 10https://gerrit.wikimedia.org/r/816026 (https://phabricator.wikimedia.org/T305414) (owner: 10Majavah) [18:25:13] RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:43] (03CR) 10Dduvall: [C: 04-1] "Sadly the CI_SERVER_TOKEN environment variable does not seem to work with `gitlab-runner run` despite it being part of the common gitlab-r" [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [18:27:17] (03PS12) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 [18:29:25] (03PS1) 10BCornwall: geodns: Map out African countries by DC latency [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) [18:29:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816022 (https://phabricator.wikimedia.org/T311788) (owner: 10Ahmon Dancy) [18:30:30] (03Merged) 10jenkins-bot: MWConfigCacheGenerator.php: Use grace period of 3 minutes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816022 (https://phabricator.wikimedia.org/T311788) (owner: 10Ahmon Dancy) [18:31:00] !log dancy@deploy1002 Started scap: Backport for [[gerrit:816022]] MWConfigCacheGenerator.php: Use grace period of 3 minutes [18:33:53] (03CR) 10CI reject: [V: 04-1] Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [18:34:39] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:816022]] MWConfigCacheGenerator.php: Use grace period of 3 minutes (duration: 03m 39s) [18:35:26] cjming and RhinosF1: Config backports should be reliable now. Please let me know if you find otherwise. [18:35:52] awesome - gtk thanks \o/ [18:36:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:36:51] (03PS2) 10BCornwall: geodns: Map out African countries by DC latency [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) [18:37:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T312863)', diff saved to https://phabricator.wikimedia.org/P31658 and previous config saved to /var/cache/conftool/dbconfig/20220721-183703-ladsgroup.json [18:37:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [18:37:07] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [18:37:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [18:37:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:37:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T312863)', diff saved to https://phabricator.wikimedia.org/P31659 and previous config saved to /var/cache/conftool/dbconfig/20220721-183723-ladsgroup.json [18:37:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:37:40] (03CR) 10BCornwall: "I went ahead and added Ethiopia even though there were no measurements (they were all 0) by copying over South Sudan's configs." [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) (owner: 10BCornwall) [18:38:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:41:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312984)', diff saved to https://phabricator.wikimedia.org/P31660 and previous config saved to /var/cache/conftool/dbconfig/20220721-184126-ladsgroup.json [18:41:30] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [18:41:31] 10SRE, 10SRE-OnFire, 10Patch-For-Review: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355 (10CDanis) I guess as another option we could host these things on `role::microsites::peopleweb` ? I'm guessing that's going to remain a tr... [18:42:15] !log running extensions/SecurePoll/cli/wm-scripts/bv2022/populateEditCount.php on all 8 sections [18:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:59] 10SRE, 10SRE-OnFire, 10serviceops, 10serviceops-collab, 10Patch-For-Review: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355 (10CDanis) [18:43:18] (03PS8) 10Krinkle: tests: Move buildConfigCache.php to tests/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814011 (https://phabricator.wikimedia.org/T169821) [18:43:22] (03PS1) 10Krinkle: multiversion: Add dblists-index.php for fast runtime lookups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816029 [18:43:26] (03PS1) 10Krinkle: [WIP] multiversion: Fix reason for 'wikipedia' suffix not working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816030 [18:44:45] (03CR) 10CI reject: [V: 04-1] multiversion: Add dblists-index.php for fast runtime lookups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816029 (owner: 10Krinkle) [18:45:39] (03PS2) 10Krinkle: multiversion: Add dblists-index.php for fast runtime lookups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816029 [18:50:10] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2066.codfw.wmnet with OS bullseye [18:51:02] (03PS13) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 [18:51:41] (03PS10) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [18:51:45] (03PS14) 10Nskaggs: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 [18:56:02] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P31661 and previous config saved to /var/cache/conftool/dbconfig/20220721-185631-ladsgroup.json [19:05:54] jouncebot next [19:05:54] In 0 hour(s) and 54 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T2000) [19:06:06] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2066.codfw.wmnet with reason: host reimage [19:06:07] bking@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [19:07:32] (03CR) 10BCornwall: "Want to get approval from other Traffic members just to make sure we're all comfortable with this switch." [puppet] - 10https://gerrit.wikimedia.org/r/814894 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [19:09:49] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2066.codfw.wmnet with reason: host reimage [19:11:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P31662 and previous config saved to /var/cache/conftool/dbconfig/20220721-191136-ladsgroup.json [19:12:37] RECOVERY - SSH on mw1324.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:20:06] (03CR) 10BCornwall: beaker: add initial beaker files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) (owner: 10Jbond) [19:20:55] (03PS1) 10Jgreen: add host frlog1002.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/816031 (https://phabricator.wikimedia.org/T306839) [19:24:49] (03CR) 10Nskaggs: Ensure quota_increase cookbook runs and validates (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [19:26:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312984)', diff saved to https://phabricator.wikimedia.org/P31663 and previous config saved to /var/cache/conftool/dbconfig/20220721-192641-ladsgroup.json [19:26:44] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [19:26:46] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [19:28:23] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:29:01] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:18] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] "Self-merging to continue testing. This is effectively a no-op, since tag.py is only intended to be invoked by scripts used in tagging a r" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/816027 (https://phabricator.wikimedia.org/T313259) (owner: 10Brennen Bearnes) [19:31:15] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2066.codfw.wmnet with OS bullseye [19:31:32] 10SRE, 10MediaWiki-Debug-Logger, 10MediaWiki-General, 10Developer Productivity, and 2 others: Post-send work someimes fatals with "Errro: The UdpSocket to 127.0.0.1:10514 has been closed" (esp mwdebug hosts) - https://phabricator.wikimedia.org/T214734 (10Krinkle) [19:31:40] 10SRE, 10MediaWiki-Debug-Logger, 10MediaWiki-General, 10Developer Productivity, and 2 others: Post-send work sometimes fatals with "Errro: The UdpSocket to 127.0.0.1:10514 has been closed" (esp mwdebug hosts) - https://phabricator.wikimedia.org/T214734 (10Krinkle) [19:31:46] 10SRE, 10MediaWiki-Debug-Logger, 10MediaWiki-General, 10Developer Productivity, and 2 others: Post-send work sometimes fatals at "Error: The UdpSocket to 127.0.0.1:10514 has been closed" (esp. mwdebug hosts) - https://phabricator.wikimedia.org/T214734 (10Krinkle) [19:31:56] 10SRE, 10MediaWiki-Debug-Logger, 10MediaWiki-General, 10Developer Productivity, and 2 others: Post-send work sometimes fatals at "Error: The UdpSocket to 127.0.0.1:10514 has been closed" (esp. mwdebug hosts) - https://phabricator.wikimedia.org/T214734 (10Krinkle) [19:34:13] !log brennen@deploy1002 Started deploy [phabricator/deployment@f962d0e]: (no justification provided) [19:34:18] !log brennen@deploy1002 Finished deploy [phabricator/deployment@f962d0e]: (no justification provided) (duration: 00m 05s) [19:35:05] (03CR) 10Andrew Bogott: [C: 03+2] Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [19:35:41] !log brennen@deploy1002 Started deploy [phabricator/deployment@f962d0e]: (no justification provided) [19:35:46] !log brennen@deploy1002 Finished deploy [phabricator/deployment@f962d0e]: (no justification provided) (duration: 00m 05s) [19:35:57] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:03] ACKNOWLEDGEMENT - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service Andrew Bogott work in progress https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:03] ACKNOWLEDGEMENT - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service Andrew Bogott work in progress https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T312863)', diff saved to https://phabricator.wikimedia.org/P31664 and previous config saved to /var/cache/conftool/dbconfig/20220721-193756-ladsgroup.json [19:38:00] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [19:43:57] (03Merged) 10jenkins-bot: Ensure quota_increase cookbook runs and validates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [19:50:03] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:08] (03CR) 10Jgreen: [C: 03+2] add host frlog1002.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/816031 (https://phabricator.wikimedia.org/T306839) (owner: 10Jgreen) [19:53:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P31665 and previous config saved to /var/cache/conftool/dbconfig/20220721-195301-ladsgroup.json [19:54:34] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [19:54:38] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [19:56:37] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [20:00:05] brennen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220721T2000). [20:00:05] ebernhardson and cjming: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:30] o/ [20:00:53] o/ [20:01:46] o/ [20:02:03] (03CR) 10Clare Ming: [C: 03+2] Revert "cirrus: Dont recycle completion suggester indices" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814909 (owner: 10Ebernhardson) [20:06:03] 10SRE-Access-Requests, 10Release-Engineering-Team: Add dancy to phabricator-roots - https://phabricator.wikimedia.org/T313551 (10dancy) [20:07:02] (03PS1) 10Ahmon Dancy: Add dancy to phabricator-roots [puppet] - 10https://gerrit.wikimedia.org/r/816035 (https://phabricator.wikimedia.org/T313551) [20:07:32] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Nova: Allow duplicate VM names in different projects. [puppet] - 10https://gerrit.wikimedia.org/r/815787 (https://phabricator.wikimedia.org/T305831) (owner: 10Andrew Bogott) [20:08:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P31666 and previous config saved to /var/cache/conftool/dbconfig/20220721-200806-ladsgroup.json [20:08:24] 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (The Decommission Mission 💀): Add dancy to phabricator-roots - https://phabricator.wikimedia.org/T313551 (10dancy) [20:08:53] 10SRE, 10MediaWiki-Debug-Logger, 10MediaWiki-General, 10Developer Productivity, and 2 others: Post-send work sometimes fatals at "Error: The UdpSocket to 127.0.0.1:10514 has been closed" (esp. mwdebug hosts) - https://phabricator.wikimedia.org/T214734 (10Krinkle) >>! In T214734#7838777, @Krinkle wrote: > M... [20:09:47] ebernhardson: brennen and i are trying to grok why your patch hasn't merged yet [20:10:11] 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (The Decommission Mission 💀): Add dancy to phabricator-roots - https://phabricator.wikimedia.org/T313551 (10dancy) Noting that my manager @thcipriani is on vacation right now. [20:10:47] cjming: sometimes gerrit wants you to click the rebase button even when it doesn't seem necessary [20:10:54] cjming: usually i rebase, remove the +2, then re-apply +2 and it merges [20:11:30] (also i'm not entirely sure thats whats happening here, but it's worked in the past :) [20:12:08] (03PS2) 10Clare Ming: Revert "cirrus: Dont recycle completion suggester indices" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814909 (owner: 10Ebernhardson) [20:13:24] (03CR) 10Clare Ming: [C: 03+2] Revert "cirrus: Dont recycle completion suggester indices" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814909 (owner: 10Ebernhardson) [20:13:41] ebernhardson: you're right! [20:14:24] damn computers [20:14:56] workflow automation: the most inconvenient approach except for all the others [20:15:30] (03Merged) 10jenkins-bot: Revert "cirrus: Dont recycle completion suggester indices" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814909 (owner: 10Ebernhardson) [20:15:37] Least worst? [20:16:06] ebernhardson: up on mwdebug1002 if it's testable [20:17:04] (03PS6) 10Clare Ming: Deploy grid to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814907 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [20:17:15] cjming: works as expected, recycled the testwiki index [20:17:23] cool - syncing [20:19:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Jgreen) 05Resolved→03Open [20:19:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Jgreen) [20:19:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:20:02] (03CR) 10Hashar: [C: 03+1] Add dancy to phabricator-roots [puppet] - 10https://gerrit.wikimedia.org/r/816035 (https://phabricator.wikimedia.org/T313551) (owner: 10Ahmon Dancy) [20:20:27] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:814909|Revert "cirrus: Dont recycle completion suggester indices"]] (duration: 02m 56s) [20:20:28] cjming@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [20:20:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:20:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:20:55] ebernhardson: mind verifying on prod? hopefully sync issues are all resolved [20:20:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Jgreen) @cmjohnson it looks like this host may have ended up in the frack-fundraising vlan (probably because I had it noted confusingly o... [20:21:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:22:31] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:23:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T312863)', diff saved to https://phabricator.wikimedia.org/P31667 and previous config saved to /var/cache/conftool/dbconfig/20220721-202311-ladsgroup.json [20:23:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [20:23:16] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [20:23:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [20:23:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:23:29] cjming: I'm lingering just in case. [20:23:38] cjming: sure, sec [20:23:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:23:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T312863)', diff saved to https://phabricator.wikimedia.org/P31668 and previous config saved to /var/cache/conftool/dbconfig/20220721-202348-ladsgroup.json [20:23:57] 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (The Decommission Mission 💀): Add dancy to phabricator-roots - https://phabricator.wikimedia.org/T313551 (10hashar) [20:24:11] cjming: recycles correctly from mwmaint1002 as well [20:24:16] woohoo [20:24:32] (03CR) 10Clare Ming: [C: 03+2] Deploy grid to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814907 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [20:24:41] cjming: i think the !log bot complaining is because wikitech has had some db lag recently (is wikitech still the source-of-truth for !log commands?) [20:24:48] the db lag occasionally rejects edits [20:26:19] andrewbogott: maybe related to the work you were doing? ^ [20:26:42] ebernhardson: https://wikitech.wikimedia.org/wiki/Server_Admin_Log is still canonical, yeah [20:27:20] (03Merged) 10jenkins-bot: Deploy grid to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814907 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [20:27:25] Hm... possible? [20:27:52] (03CR) 10Brennen Bearnes: [C: 03+1] Add dancy to phabricator-roots [puppet] - 10https://gerrit.wikimedia.org/r/816035 (https://phabricator.wikimedia.org/T313551) (owner: 10Ahmon Dancy) [20:27:54] i don't have any strong proof, but wikitech rejected my first two attempts to add an item to the deployment calendar earlier today complaining about db lag [20:28:02] !log testing the log by logging a test [20:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:21] ebernhardson: I definitely was messing with it briefly but it 'should' all be back to normal now [20:28:43] ack [20:28:45] Are you seeing bad behavior now/in the last couple hours? [20:29:20] andrewbogott: the bot rejected a !log command a couple minutes ago: 13:20:28 +stashbot | cjming@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [20:29:29] 9 minutes ago [20:29:38] ok [20:30:55] (03CR) 10Dzahn: [C: 03+1] Add dancy to phabricator-roots [puppet] - 10https://gerrit.wikimedia.org/r/816035 (https://phabricator.wikimedia.org/T313551) (owner: 10Ahmon Dancy) [20:31:25] there's kind of a wide variety of errors in that log :/ [20:31:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:32:02] !log cjming@deploy1002 Synchronized wmf-config: Config: [[gerrit:814907|Deploy grid to all wikis (T312241)]] (duration: 03m 13s) [20:32:04] cjming@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [20:32:05] T312241: Deploy the new grid layout - https://phabricator.wikimedia.org/T312241 [20:32:15] happened again ^^ [20:32:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:32:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:33:01] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:34:05] !log Proof of life for stashbot processing !logs [20:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:12] andrewbogott: ^ [20:34:33] !log end of UTC late backport window [20:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:07] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (The Decommission Mission 💀): Add dancy to phabricator-roots - https://phabricator.wikimedia.org/T313551 (10brennen) @kchapman could you sign off as skip-level manager? [20:35:34] cjming, ebernhardson, bd808 is telling me that stashbot doesn't confirm logmsgbot messages, so that's intended behavior. [20:35:44] (Of course the failure message is a different story) [20:36:32] the "Failed to log message to wiki." were likely legit while wikitech was on the new hosts and had bad db grants [20:37:28] but not ack'ing !log from logmsgbot is currently by design. we turned that off last week in an attempt at reducing some bot spam here. [20:37:35] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:39:04] !log disabling puppet on mw appservers to deploy gerrit:809324 - T310738 [20:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:09] T310738: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 [20:39:44] (03PS1) 10Hashar: gerrit: $gerrit_servers > $ssh_allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/816038 [20:40:06] (03CR) 10Hashar: gerrit: add gerrit2002 to firewall rules for cluster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815398 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [20:40:26] (03PS6) 10Mary Yang: DO-NOT-SUBMIT(Under review and discussion): Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 [20:40:28] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/816038 (owner: 10Hashar) [20:43:32] (03CR) 10Dzahn: [V: 03+1 C: 03+2] mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [20:44:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1158.eqiad.wmnet with reason: Maintenance [20:44:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1158.eqiad.wmnet with reason: Maintenance [20:44:58] !log dancy@deploy1002 backport aborted: (duration: 00m 02s) [20:44:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:45:02] dancy@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [20:45:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:45:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T312984)', diff saved to https://phabricator.wikimedia.org/P31669 and previous config saved to /var/cache/conftool/dbconfig/20220721-204518-ladsgroup.json [20:45:22] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [20:45:54] (03CR) 10CI reject: [V: 04-1] DO-NOT-SUBMIT(Under review and discussion): Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (owner: 10Mary Yang) [20:47:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T312863)', diff saved to https://phabricator.wikimedia.org/P31670 and previous config saved to /var/cache/conftool/dbconfig/20220721-204721-ladsgroup.json [20:47:24] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [20:47:25] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [20:52:38] !log deploying apache config change on cluster, slowly..puppet disabled on C:profile::mediawiki::httpd .. then re-enabling starting with mwdebug.. using httpbb to test it.. then re-enabling puppet on more hosts https://gerrit.wikimedia.org/r/c/operations/puppet/+/809324 Bug: T310738 [20:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:43] T310738: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 [20:53:25] (03CR) 10Hashar: [C: 03+1] gerrit: add gerrit2002 to puppetized known_hosts file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [20:53:47] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "[deploy1002:~] $ for tests in foundation main redirects remnant secure wikimania_wikimedia wwwportals; do httpbb /srv/deployment/httpbb-te" [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [20:59:47] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:37] ^ not really expected to get that alert but it does make sense because I merged my change [21:00:56] where the tests have been upgraded and the actual servers are being re-enabled over time [21:01:18] I am so far testing this on mwdebug* , multiple of them and it does what it should do [21:02:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P31671 and previous config saved to /var/cache/conftool/dbconfig/20220721-210226-ladsgroup.json [21:02:28] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [21:03:19] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:04:05] ACKNOWLEDGEMENT - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service daniel_zahn deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/809324 duplicate check https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:05] ACKNOWLEDGEMENT - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver daniel_zahn deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/809324 duplicate check https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:17:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P31672 and previous config saved to /var/cache/conftool/dbconfig/20220721-211732-ladsgroup.json [21:17:56] !log puppet re-enabled on mw-api-canary and parsoid-canary [21:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:08] (03CR) 10Hashar: Send events to Wikimedia EventGate (031 comment) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [21:23:15] (03PS1) 10Southparkfan: rsyslog: allow specifying TLS client auth settings and filename property [puppet] - 10https://gerrit.wikimedia.org/r/816046 (https://phabricator.wikimedia.org/T127717) [21:24:39] (03CR) 10CI reject: [V: 04-1] rsyslog: allow specifying TLS client auth settings and filename property [puppet] - 10https://gerrit.wikimedia.org/r/816046 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [21:29:41] (03PS2) 10Southparkfan: rsyslog: allow specifying TLS client auth settings and filename property [puppet] - 10https://gerrit.wikimedia.org/r/816046 (https://phabricator.wikimedia.org/T127717) [21:30:58] (03PS3) 10Southparkfan: rsyslog: allow specifying TLS client auth settings and filename property [puppet] - 10https://gerrit.wikimedia.org/r/816046 (https://phabricator.wikimedia.org/T127717) [21:31:34] (03PS4) 10Southparkfan: rsyslog: allow specifying TLS client auth settings and filename property [puppet] - 10https://gerrit.wikimedia.org/r/816046 (https://phabricator.wikimedia.org/T127717) [21:32:28] (03PS2) 10Clare Ming: Remove Table of Contents config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810405 (https://phabricator.wikimedia.org/T310527) [21:32:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T312863)', diff saved to https://phabricator.wikimedia.org/P31673 and previous config saved to /var/cache/conftool/dbconfig/20220721-213237-ladsgroup.json [21:32:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [21:32:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [21:32:42] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [21:32:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T312863)', diff saved to https://phabricator.wikimedia.org/P31674 and previous config saved to /var/cache/conftool/dbconfig/20220721-213246-ladsgroup.json [21:38:57] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:45:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312984)', diff saved to https://phabricator.wikimedia.org/P31675 and previous config saved to /var/cache/conftool/dbconfig/20220721-214532-ladsgroup.json [21:45:37] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [21:48:31] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Francesco Negri - https://phabricator.wikimedia.org/T313504 (10Peachey88) [21:48:31] !log re-enabling puppet on parsoid (wtp*) [21:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:43] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:21] !log re-enabling puppet on mw2 in groups (codfw) [21:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:54] (03CR) 10Jdlrobson: [C: 03+1] "Since this was true on master in 1.39.0-wmf.20 and the code is removed in 1.39.0-wmf.21 I believe this is safe to merge at your earliest c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810405 (https://phabricator.wikimedia.org/T310527) (owner: 10Clare Ming) [22:00:35] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:00:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P31676 and previous config saved to /var/cache/conftool/dbconfig/20220721-220038-ladsgroup.json [22:02:33] !log dancy@deploy1002 Installing scap version "4.11.3" for 559 hosts [22:02:58] !log dancy@deploy1002 Installation of scap version "4.11.3" completed for 559 hosts [22:05:05] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2045.codfw.wmnet with OS bullseye [22:05:12] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2045.codfw.wmnet with OS bullseye [22:09:24] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2045.codfw.wmnet with OS bullseye [22:09:29] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2045.codfw.wmnet with OS bullseye executed with errors: - elastic... [22:15:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P31677 and previous config saved to /var/cache/conftool/dbconfig/20220721-221543-ladsgroup.json [22:18:54] (03PS1) 10BCornwall: geodns: Move eqsin, drmrs and esams around in Asia [dns] - 10https://gerrit.wikimedia.org/r/816053 [22:22:48] brett: that sounds like a lot of work, I bet they're heavy [22:26:23] rzl: Made me laugh :D [22:26:53] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:26:59] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2045.codfw.wmnet with OS bullseye [22:27:06] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2045.codfw.wmnet with OS bullseye [22:27:46] (thanks for the work though! always glad to see us get Faster) [22:28:03] mmandere did all the work :) [22:28:35] well, thanks both then! [22:28:42] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "[deploy1002:~] $ for tests in main redirects; do httpbb /srv/deployment/httpbb-tests/appserver/test_${tests}.yaml --hosts=mw[2251-2253,226" [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [22:30:29] !log re-enabling puppet on all remaining 'C:profile::mediawiki::httpd' [22:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312984)', diff saved to https://phabricator.wikimedia.org/P31678 and previous config saved to /var/cache/conftool/dbconfig/20220721-223048-ladsgroup.json [22:30:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [22:30:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [22:30:54] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [22:36:51] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:52:17] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2045.codfw.wmnet with reason: host reimage [22:55:58] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2045.codfw.wmnet with reason: host reimage [23:09:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:11:34] (03PS7) 10Mary Yang: DO-NOT-SUBMIT(Under review and discussion): Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) [23:12:24] (03CR) 10CI reject: [V: 04-1] DO-NOT-SUBMIT(Under review and discussion): Add puppet profile and role files for wikifunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [23:12:45] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2045.codfw.wmnet with OS bullseye [23:12:52] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2045.codfw.wmnet with OS bullseye completed: - elastic2045 (**PAS... [23:13:24] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:16:38] (03PS8) 10Mary Yang: DO-NOT-SUBMIT(Under review and discussion): [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) [23:17:28] (03CR) 10CI reject: [V: 04-1] DO-NOT-SUBMIT(Under review and discussion): [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [23:22:05] !log [cumin2002:~] $ sudo cumin 'C:profile::httpbb' "rm /srv/deployment/httpbb-tests/appserver/test_search.yaml" [23:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:12] (03PS9) 10Mary Yang: DO-NOT-SUBMIT(Under review and discussion): [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) [23:24:32] (03CR) 10CI reject: [V: 04-1] DO-NOT-SUBMIT(Under review and discussion): [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [23:27:57] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "deployed / tested on every single host https://wikitech.wikimedia.org/wiki/User:Dzahn/apache_testing" [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [23:29:20] (03CR) 10Mary Yang: "Hi Daniel, does this look like one of the paths we discussed per https://phabricator.wikimedia.org/T311457? This is the "keep a separate p" [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [23:29:22] (03PS2) 10Dzahn: discovery: switchover doc to doc1002 [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [23:29:48] (03CR) 10Dzahn: "this is already done, therefore it rebases into nothing. sorry if I missed that" [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [23:31:17] (03PS1) 10Tim Starling: Increase core session expiry to 86400 to match CentralAuth [deployment-charts] - 10https://gerrit.wikimedia.org/r/816059 (https://phabricator.wikimedia.org/T313496) [23:33:56] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 4 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) This has been deployed to all appservers and passes the tests in redirects and all other tests on all the hosts: ` [depl... [23:38:13] (03PS1) 10Tim Starling: Increase $wgObjectCacheSessionExpiry to 86400 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816060 (https://phabricator.wikimedia.org/T313496) [23:38:15] (03PS1) 10Tim Starling: Move CentralAuth sessions to Kask [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816061 (https://phabricator.wikimedia.org/T313496) [23:40:31] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 4 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) test from external: ` curl -H "Host: policy.wikimedia.org" https://dyna.wikimedia.org ..

The document has moved !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T312863)', diff saved to https://phabricator.wikimedia.org/P31680 and previous config saved to /var/cache/conftool/dbconfig/20220721-234045-ladsgroup.json [23:40:51] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [23:43:14] (03CR) 10Dzahn: [C: 03+2] "curl -H "Host: policy.wikimedia.org" https://dyna.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/808309 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [23:45:15] (03PS3) 10Dzahn: switch policy.wikimedia.org back from Wordpress to WMF DNS [dns] - 10https://gerrit.wikimedia.org/r/808309 (https://phabricator.wikimedia.org/T310738) [23:53:04] !log https://policy.wikimedia.org moved from Wordpress DNS back to WMF DNS - now redirects to https://wikimediafoundation.org/advocacy/ as requested on T310738 | this might also resolve T132104 or not because wikimediafoundation.org is also on wordpress VIP [23:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:09] T132104: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104 [23:53:10] T310738: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 [23:55:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P31681 and previous config saved to /var/cache/conftool/dbconfig/20220721-235551-ladsgroup.json [23:55:52] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [23:55:53] PROBLEM - Host policy.wikimedia.org is DOWN: /bin/ping -6 -n -U -w 10 -c 2 policy.wikimedia.org [23:57:13] RECOVERY - Host policy.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.13 ms [23:57:19] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 4 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) >>! In T310738#8070654, @Varnent wrote: > Just wanted to check on if there is anything else you are waiting from me on. I... [23:57:53] oh, it was a host in icinga, that I did not expect [23:58:09] well, thaat recovery makes sense, it points now to dyna.wikimedia [23:59:20] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 4 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) 05Open→03In progress p:05Medium→03High