[00:00:18] RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [00:00:33] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1160.eqiad.wmnet with reason: host reimage [00:02:38] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [00:02:57] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:03:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1160.eqiad.wmnet with reason: host reimage [00:04:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:04:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1159.eqiad.wmnet with OS bullseye [00:04:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1159.eqiad.wmnet with OS bullseye completed: - an-worker1159 (**WA... [00:04:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) [00:11:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2035.codfw.wmnet with reason: host reimage [00:11:27] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:12:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:12:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2034.codfw.wmnet with OS bullseye [00:12:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2034.codfw.wmnet with OS bullseye completed: - restbase2034 (**PASS*... [00:14:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2028.codfw.wmnet with OS bullseye [00:14:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2028.codfw.wmnet with OS bullseye [00:14:45] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) [00:14:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2035.codfw.wmnet with reason: host reimage [00:24:57] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:25:20] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:27:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:27:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1160.eqiad.wmnet with OS bullseye [00:27:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye completed: - an-worker1160 (**WA... [00:31:15] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:33:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2028.codfw.wmnet with reason: host reimage [00:35:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:35:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2035.codfw.wmnet with OS bullseye [00:35:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2035.codfw.wmnet with OS bullseye completed: - restbase2035 (**PASS*... [00:37:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2028.codfw.wmnet with reason: host reimage [00:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/978166 [00:38:39] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/978166 (owner: 10TrainBranchBot) [00:42:10] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:08] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) [00:53:03] (03PS1) 10Papaul: Add new logging-hd nodes to site.pp and apt_repo.yaml [puppet] - 10https://gerrit.wikimedia.org/r/978156 (https://phabricator.wikimedia.org/T349834) [01:08:37] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1161.eqiad.wmnet with OS bullseye [01:08:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [01:38:24] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1162.eqiad.wmnet with OS bullseye [01:38:30] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1163.eqiad.wmnet with OS bullseye [01:38:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1162.eqiad.wmnet with OS bullseye [01:38:33] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1161.eqiad.wmnet with OS bullseye [01:38:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1163.eqiad.wmnet with OS bullseye [01:38:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye [01:40:20] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1165.eqiad.wmnet with OS bullseye [01:40:21] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1164.eqiad.wmnet with OS bullseye [01:40:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1165.eqiad.wmnet with OS bullseye [01:40:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1164.eqiad.wmnet with OS bullseye [01:41:31] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1166.eqiad.wmnet with OS bullseye [01:41:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1166.eqiad.wmnet with OS bullseye [01:42:06] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1167.eqiad.wmnet with OS bullseye [01:42:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1167.eqiad.wmnet with OS bullseye [01:43:12] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1169.eqiad.wmnet with OS bullseye [01:43:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1168.eqiad.wmnet with OS bullseye [01:43:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1169.eqiad.wmnet with OS bullseye [01:43:47] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1170.eqiad.wmnet with OS bullseye [01:43:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1170.eqiad.wmnet with OS bullseye [01:45:03] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1171.eqiad.wmnet with OS bullseye [01:45:12] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1172.eqiad.wmnet with OS bullseye [01:45:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1171.eqiad.wmnet with OS bullseye [01:45:19] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1173.eqiad.wmnet with OS bullseye [01:45:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1172.eqiad.wmnet with OS bullseye [01:45:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1173.eqiad.wmnet with OS bullseye [01:46:18] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1174.eqiad.wmnet with OS bullseye [01:46:22] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1175.eqiad.wmnet with OS bullseye [01:46:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1174.eqiad.wmnet with OS bullseye [01:46:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1175.eqiad.wmnet with OS bullseye [01:52:36] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1163.eqiad.wmnet with reason: host reimage [01:54:04] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1165.eqiad.wmnet with reason: host reimage [01:54:25] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1164.eqiad.wmnet with reason: host reimage [01:55:33] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1166.eqiad.wmnet with reason: host reimage [01:55:44] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1167.eqiad.wmnet with reason: host reimage [01:55:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1163.eqiad.wmnet with reason: host reimage [01:57:16] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1169.eqiad.wmnet with reason: host reimage [01:58:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1166.eqiad.wmnet with reason: host reimage [01:58:52] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1173.eqiad.wmnet with reason: host reimage [01:59:55] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1174.eqiad.wmnet with reason: host reimage [01:59:58] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1175.eqiad.wmnet with reason: host reimage [02:00:11] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1168.eqiad.wmnet with reason: host reimage [02:00:41] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1170.eqiad.wmnet with reason: host reimage [02:01:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1169.eqiad.wmnet with reason: host reimage [02:03:57] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1165.eqiad.wmnet with reason: host reimage [02:03:57] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1164.eqiad.wmnet with reason: host reimage [02:04:07] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1173.eqiad.wmnet with reason: host reimage [02:04:48] (03CR) 10Krinkle: [C: 03+1] Add virtual domain for botpasswords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976787 (https://phabricator.wikimedia.org/T351559) (owner: 10Ladsgroup) [02:05:44] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1167.eqiad.wmnet with reason: host reimage [02:09:00] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [02:09:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1175.eqiad.wmnet with reason: host reimage [02:09:55] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1174.eqiad.wmnet with reason: host reimage [02:10:11] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1168.eqiad.wmnet with reason: host reimage [02:10:41] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1170.eqiad.wmnet with reason: host reimage [02:12:35] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1164.eqiad.wmnet with OS bullseye [02:12:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1164.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [02:15:59] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:17:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:17:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1163.eqiad.wmnet with OS bullseye [02:17:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1163.eqiad.wmnet with OS bullseye completed: - an-worker1163 (**WA... [02:18:18] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1168.eqiad.wmnet with OS bullseye [02:18:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1168.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [02:20:23] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:21:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:21:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1166.eqiad.wmnet with OS bullseye [02:21:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1166.eqiad.wmnet with OS bullseye completed: - an-worker1166 (**WA... [02:23:15] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:24:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:24:26] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1169.eqiad.wmnet with OS bullseye [02:24:29] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:24:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1169.eqiad.wmnet with OS bullseye completed: - an-worker1169 (**WA... [02:25:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:26:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1165.eqiad.wmnet with OS bullseye [02:26:05] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:26:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1165.eqiad.wmnet with OS bullseye completed: - an-worker1165 (**WA... [02:27:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:27:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1175.eqiad.wmnet with OS bullseye [02:27:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1175.eqiad.wmnet with OS bullseye completed: - an-worker1175 (**PA... [02:27:34] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:28:10] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1164.eqiad.wmnet with OS bullseye [02:28:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1164.eqiad.wmnet with OS bullseye [02:28:17] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1168.eqiad.wmnet with OS bullseye [02:28:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1168.eqiad.wmnet with OS bullseye [02:28:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:28:39] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:28:46] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1173.eqiad.wmnet with OS bullseye [02:28:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1173.eqiad.wmnet with OS bullseye completed: - an-worker1173 (**WA... [02:30:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:30:23] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:30:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1167.eqiad.wmnet with OS bullseye [02:30:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1167.eqiad.wmnet with OS bullseye completed: - an-worker1167 (**WA... [02:31:28] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:31:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1174.eqiad.wmnet with OS bullseye [02:31:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1174.eqiad.wmnet with OS bullseye completed: - an-worker1174 (**WA... [02:32:49] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:33:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [02:33:57] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1170.eqiad.wmnet with OS bullseye [02:34:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1170.eqiad.wmnet with OS bullseye completed: - an-worker1170 (**WA... [02:39:00] (JobUnavailable) firing: (6) Reduced availability for job pybal in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:05] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:42:48] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1168.eqiad.wmnet with reason: host reimage [02:43:08] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1164.eqiad.wmnet with reason: host reimage [02:45:00] (03CR) 10Krinkle: [C: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [02:45:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1168.eqiad.wmnet with reason: host reimage [02:47:39] PROBLEM - Dell PowerEdge RAID Controller on db1199 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [02:47:40] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on db1199 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T352238 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [02:47:44] 10SRE, 10ops-eqiad: Degraded RAID on db1199 - https://phabricator.wikimedia.org/T352238 (10ops-monitoring-bot) [02:48:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1164.eqiad.wmnet with reason: host reimage [02:55:34] (03CR) 10Gergő Tisza: "Scheduled for https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T2100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977791 (owner: 10Gergő Tisza) [02:58:40] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1162.eqiad.wmnet with OS bullseye [02:58:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1162.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [02:58:47] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1161.eqiad.wmnet with OS bullseye [02:58:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [03:05:19] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1171.eqiad.wmnet with OS bullseye [03:05:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1171.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [03:05:27] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1172.eqiad.wmnet with OS bullseye [03:05:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1172.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [03:06:58] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [03:08:26] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [03:08:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1168.eqiad.wmnet with OS bullseye [03:08:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1168.eqiad.wmnet with OS bullseye completed: - an-worker1168 (**WA... [03:09:00] (JobUnavailable) firing: (6) Reduced availability for job pybal in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:09] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [03:11:05] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:11:28] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [03:11:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1164.eqiad.wmnet with OS bullseye [03:11:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1164.eqiad.wmnet with OS bullseye completed: - an-worker1164 (**WA... [03:12:41] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1161.eqiad.wmnet with OS bullseye [03:12:54] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1162.eqiad.wmnet with OS bullseye [03:13:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye [03:13:09] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1171.eqiad.wmnet with OS bullseye [03:13:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1162.eqiad.wmnet with OS bullseye [03:13:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1171.eqiad.wmnet with OS bullseye [03:13:22] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1172.eqiad.wmnet with OS bullseye [03:13:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1172.eqiad.wmnet with OS bullseye [03:48:35] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [03:49:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [03:58:13] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:01:06] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:14:27] (03PS10) 10Andrea Denisse: ircecho: Migrate the ircecho script from Python 2 to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) [04:24:57] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:27:01] (03PS1) 10KartikMistry: Update Apertium to 2023-11-29-041830-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/978189 (https://phabricator.wikimedia.org/T270060) [04:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:32:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1161.eqiad.wmnet with OS bullseye [04:33:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [04:33:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1162.eqiad.wmnet with OS bullseye [04:33:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1162.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [04:33:23] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1171.eqiad.wmnet with OS bullseye [04:33:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1171.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [04:33:37] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1172.eqiad.wmnet with OS bullseye [04:33:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1172.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [04:52:07] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:53:29] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.266 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:13:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 44.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:18:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 44.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:23:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 49.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:28:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 46.95% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:01:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:01:47] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:09:00] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [06:09:17] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:10:11] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:10:39] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:10:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T0700) [07:02:59] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:03:33] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1199 - https://phabricator.wikimedia.org/T352238 (10Marostegui) Disk #8 is broken ` TOPOLOGY : ======== ---------------------------------------------------------------------------- DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace TR -----... [07:09:57] (JobUnavailable) firing: (5) Reduced availability for job pybal in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:12:52] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978365 (https://phabricator.wikimedia.org/T351620) [07:13:46] (03PS1) 10Marostegui: pc1012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978366 (https://phabricator.wikimedia.org/T351620) [07:13:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2012.codfw.wmnet,pc[1012,1014].eqiad.wmnet with reason: Switch [07:13:58] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978365 (https://phabricator.wikimedia.org/T351620) (owner: 10Marostegui) [07:14:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2012.codfw.wmnet,pc[1012,1014].eqiad.wmnet with reason: Switch [07:14:34] (03CR) 10Marostegui: [C: 03+2] pc1012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978366 (https://phabricator.wikimedia.org/T351620) (owner: 10Marostegui) [07:14:43] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978365 (https://phabricator.wikimedia.org/T351620) (owner: 10Marostegui) [07:15:52] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:978365|ProductionServices.php: Promote pc1014 to pc2 (T351620)]] [07:15:58] T351620: Upgrade pc2 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351620 [07:17:17] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:978365|ProductionServices.php: Promote pc1014 to pc2 (T351620)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:18:40] !log marostegui@deploy2002 marostegui: Continuing with sync [07:20:48] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978074 [07:23:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1027 T351916', diff saved to https://phabricator.wikimedia.org/P53931 and previous config saved to /var/cache/conftool/dbconfig/20231129-072306-root.json [07:23:12] T351916: Migrate es1 to Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351916 [07:23:40] (03PS1) 10Marostegui: es1027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978368 (https://phabricator.wikimedia.org/T351916) [07:24:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1027.eqiad.wmnet with OS bookworm [07:24:33] (03CR) 10Marostegui: [C: 03+2] es1027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978368 (https://phabricator.wikimedia.org/T351916) (owner: 10Marostegui) [07:25:18] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:978365|ProductionServices.php: Promote pc1014 to pc2 (T351620)]] (duration: 09m 25s) [07:25:23] T351620: Upgrade pc2 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351620 [07:26:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc1012.eqiad.wmnet with OS bookworm [07:32:12] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-October-December): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10KartikMistry) @elukey, @akosiaris What can be the next step for this? [07:37:12] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:37:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on es1027.eqiad.wmnet with reason: host reimage [07:38:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1012.eqiad.wmnet with reason: host reimage [07:40:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1027.eqiad.wmnet with reason: host reimage [07:43:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1012.eqiad.wmnet with reason: host reimage [07:45:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, I'll abandon the original conversion patch" [puppet] - 10https://gerrit.wikimedia.org/r/978154 (owner: 10Andrew Bogott) [07:45:21] (03Abandoned) 10Muehlenhoff: openstack::base::wikitech::web: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/977170 (owner: 10Muehlenhoff) [07:49:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [07:50:02] (03PS1) 10Marostegui: Revert "es1027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978075 [07:55:47] (03PS1) 10Marostegui: Revert "pc1012: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978076 [07:56:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1027.eqiad.wmnet with OS bookworm [07:56:22] (03CR) 10Marostegui: [C: 03+2] Revert "es1027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978075 (owner: 10Marostegui) [07:57:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53932 and previous config saved to /var/cache/conftool/dbconfig/20231129-075738-root.json [07:58:48] (03PS1) 10EoghanGaffney: [admin] Add toyofuku to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/978463 (https://phabricator.wikimedia.org/T351857) [08:00:05] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T0800). [08:00:05] aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:36] o/ [08:01:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1012.eqiad.wmnet with OS bookworm [08:03:59] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:09] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:05:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:07] Given that the deployment didn't start yet, I am going to quickly deploy a revert for the pc2 switchover [08:10:13] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978074 (owner: 10Marostegui) [08:10:33] (03CR) 10Marostegui: [C: 03+2] Revert "pc1012: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978076 (owner: 10Marostegui) [08:10:35] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/978463 (https://phabricator.wikimedia.org/T351857) (owner: 10EoghanGaffney) [08:10:54] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978074 (owner: 10Marostegui) [08:11:21] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:978074|Revert "ProductionServices.php: Promote pc1014 to pc2"]] [08:12:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53933 and previous config saved to /var/cache/conftool/dbconfig/20231129-081243-root.json [08:12:45] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:978074|Revert "ProductionServices.php: Promote pc1014 to pc2"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:13:03] !log marostegui@deploy2002 marostegui: Continuing with sync [08:18:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 44.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:19:22] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:978074|Revert "ProductionServices.php: Promote pc1014 to pc2"]] (duration: 08m 01s) [08:21:53] (03PS1) 10Muehlenhoff: archiva: Update outdated comments [puppet] - 10https://gerrit.wikimedia.org/r/978466 [08:22:43] !log oathauth_users from private.dblist T348693 [08:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:48] T348693: Drop oathauth_users table from production - https://phabricator.wikimedia.org/T348693 [08:23:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 46.34% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:24:57] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:25:51] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:27:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53934 and previous config saved to /var/cache/conftool/dbconfig/20231129-082748-root.json [08:28:28] !log oathauth_users from fishbowl.dblist T348693 [08:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:36] T348693: Drop oathauth_users table from production - https://phabricator.wikimedia.org/T348693 [08:30:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:46] (03CR) 10Ayounsi: [C: 03+1] P:trafficserver::backend: netbox-next switch to netbox-next.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/797320 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [08:33:06] jouncebot: nowandnext [08:33:06] For the next 0 hour(s) and 26 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T0800) [08:33:07] In 0 hour(s) and 26 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T0900) [08:33:31] !log Drop oathauth_users from centralauth T348693 [08:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:37] T348693: Drop oathauth_users table from production - https://phabricator.wikimedia.org/T348693 [08:34:08] (03CR) 10Marostegui: [C: 03+2] oathauth_users: Prepare for removal [puppet] - 10https://gerrit.wikimedia.org/r/977123 (https://phabricator.wikimedia.org/T348693) (owner: 10Marostegui) [08:34:08] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:09] aanzx: hello, looks like nobody was around for the deployment window :-\ [08:34:49] and both patches are apparently in merge conflict bahhh [08:35:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:35:44] (03PS8) 10Anzx: zghwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975379 (https://phabricator.wikimedia.org/T350241) [08:35:58] (03PS4) 10Hashar: Enable VisualEditor in the Appendix namespace on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978023 (https://phabricator.wikimedia.org/T350926) (owner: 10Anzx) [08:36:25] aanzx: I am doing the one for "Enable VisualEditor in the Appendix namespace on enwiktionary" [08:36:31] ok [08:36:37] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1199 - https://phabricator.wikimedia.org/T352238 (10wiki_willy) ++ @Jclark-ctr & @VRiley-WMF - can one of you two work on getting the drive RMA'd for this one? Thanks, Willy [08:36:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978023 (https://phabricator.wikimedia.org/T350926) (owner: 10Anzx) [08:37:17] then I don't know what is really needed to turn on the VisualEditor, hopefully changing that flag is enough [08:37:49] (03Merged) 10jenkins-bot: Enable VisualEditor in the Appendix namespace on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978023 (https://phabricator.wikimedia.org/T350926) (owner: 10Anzx) [08:38:14] !log hashar@deploy2002 Started scap: Backport for [[gerrit:978023|Enable VisualEditor in the Appendix namespace on enwiktionary (T350926)]] [08:38:20] T350926: Enable VisualEditor in the Appendix namespace on English Wiktionary - https://phabricator.wikimedia.org/T350926 [08:39:41] !log hashar@deploy2002 hashar and anzx: Backport for [[gerrit:978023|Enable VisualEditor in the Appendix namespace on enwiktionary (T350926)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:40:03] hashar: testing [08:40:30] (03CR) 10Majavah: [C: 03+2] wikitech: remove port 80 ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/978154 (owner: 10Andrew Bogott) [08:41:38] aanzx: looks like it works for me now [08:41:44] hashar: looks good [08:41:50] !log hashar@deploy2002 hashar and anzx: Continuing with sync [08:41:54] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:42:37] for the logo change ( https://en.wiktionary.org/wiki/Appendix:French_verbs ) I haven't done that in like 10 years or so [08:42:40] I gotta read the doc [08:42:41] :D [08:42:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53935 and previous config saved to /var/cache/conftool/dbconfig/20231129-084253-root.json [08:43:30] (03PS1) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/978468 (https://phabricator.wikimedia.org/T351864) [08:46:59] the other change is still deploying [08:47:17] it started to be rather slow since a few months ago :-\ [08:48:16] 10sre-alert-triage, 10cloud-services-team: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342757 (10LSobanski) 05Open→03Resolved a:03LSobanski Resolving as the alert is no longer active. [08:48:25] !log hashar@deploy2002 Finished scap: Backport for [[gerrit:978023|Enable VisualEditor in the Appendix namespace on enwiktionary (T350926)]] (duration: 10m 10s) [08:48:30] T350926: Enable VisualEditor in the Appendix namespace on English Wiktionary - https://phabricator.wikimedia.org/T350926 [08:48:41] done [08:48:47] (03CR) 10Hashar: [C: 03+2] zghwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975379 (https://phabricator.wikimedia.org/T350241) (owner: 10Anzx) [08:48:54] aanzx: I am doing the other change :) [08:49:20] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:33] Ok [08:49:56] (03Merged) 10jenkins-bot: zghwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975379 (https://phabricator.wikimedia.org/T350241) (owner: 10Anzx) [08:50:20] (03CR) 10Majavah: [C: 03+2] team-wmcs: Merge systemd ForLong alert to the main one [alerts] - 10https://gerrit.wikimedia.org/r/977742 (https://phabricator.wikimedia.org/T352059) (owner: 10Majavah) [08:50:21] 10sre-alert-triage, 10Infrastructure-Foundations: Alert triage: overdue alert [warning] puppet fails on idp-test1002 - https://phabricator.wikimedia.org/T343898 (10LSobanski) 05Open→03Resolved a:03LSobanski Resolving as the alert is no longer active. [08:50:38] (03CR) 10Majavah: [C: 03+2] team-wmcs: improve host down alerts [alerts] - 10https://gerrit.wikimedia.org/r/977743 (https://phabricator.wikimedia.org/T352059) (owner: 10Majavah) [08:50:42] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:50:42] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:45] (03CR) 10MVernon: [C: 03+1] "c3abe8cd63928873efb232e265ddc24c hieradata/hosts/dbproxy1022.yaml" [dns] - 10https://gerrit.wikimedia.org/r/978468 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [08:52:01] (03Merged) 10jenkins-bot: team-wmcs: Merge systemd ForLong alert to the main one [alerts] - 10https://gerrit.wikimedia.org/r/977742 (https://phabricator.wikimedia.org/T352059) (owner: 10Majavah) [08:52:31] (03Merged) 10jenkins-bot: team-wmcs: improve host down alerts [alerts] - 10https://gerrit.wikimedia.org/r/977743 (https://phabricator.wikimedia.org/T352059) (owner: 10Majavah) [08:52:34] !log hashar@deploy2002 Started scap: Backport for [[gerrit:975379|zghwiki: add logos (T350241)]] [08:52:40] T350241: Post-creation work for zghwiki - https://phabricator.wikimedia.org/T350241 [08:52:46] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:53:04] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/978468 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [08:53:55] !log hashar@deploy2002 hashar and anzx: Backport for [[gerrit:975379|zghwiki: add logos (T350241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:53:56] !log Failover m1-master from dbproxy1022 to dbproxy1024 T351864 [08:53:59] hashar: testing [08:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:05] T351864: Migrate dbproxy hosts to Bookworm - https://phabricator.wikimedia.org/T351864 [08:54:28] aanzx: maybe some logo are cached and need to be purged [08:55:15] hashar: new logo appears for me [08:55:24] awesome! [08:55:27] !log hashar@deploy2002 hashar and anzx: Continuing with sync [08:57:36] (03PS1) 10Vgutierrez: lvs::balancer: Limit ipip-mq-optimizer prometheus endpoint reachability [puppet] - 10https://gerrit.wikimedia.org/r/978470 [08:57:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53936 and previous config saved to /var/cache/conftool/dbconfig/20231129-085758-root.json [08:58:46] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:59:12] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:29] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/752/con" [puppet] - 10https://gerrit.wikimedia.org/r/978470 (owner: 10Vgutierrez) [09:00:06] hashar and jeena: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T0900). [09:00:27] * hashar whistles [09:00:36] I am tired of that bot humor really [09:00:42] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:13] !log hashar@deploy2002 Finished scap: Backport for [[gerrit:975379|zghwiki: add logos (T350241)]] (duration: 09m 39s) [09:02:20] T350241: Post-creation work for zghwiki - https://phabricator.wikimedia.org/T350241 [09:02:28] Thanks hashar [09:02:34] hurrah :) [09:02:43] aanzx: thank you for the config patches [09:03:55] (03PS1) 10Marostegui: dbproxy1025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978471 (https://phabricator.wikimedia.org/T351864) [09:04:01] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978472 (https://phabricator.wikimedia.org/T350083) [09:04:03] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978472 (https://phabricator.wikimedia.org/T350083) (owner: 10TrainBranchBot) [09:04:13] (03CR) 10Ayounsi: [C: 03+1] "lgtm (and PCC output looks good)" [puppet] - 10https://gerrit.wikimedia.org/r/978470 (owner: 10Vgutierrez) [09:04:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1025.eqiad.wmnet with OS bookworm [09:04:41] (03CR) 10Marostegui: [C: 03+2] dbproxy1025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978471 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [09:04:52] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] lvs::balancer: Limit ipip-mq-optimizer prometheus endpoint reachability [puppet] - 10https://gerrit.wikimedia.org/r/978470 (owner: 10Vgutierrez) [09:04:54] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978472 (https://phabricator.wikimedia.org/T350083) (owner: 10TrainBranchBot) [09:05:22] marostegui: already merging yours or got a prompt with mine too? [09:05:29] vgutierrez: No, it is mergning [09:05:33] Merging [09:05:33] ack [09:05:40] vgutierrez: done [09:05:51] thx <3 [09:08:36] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:10:30] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 43.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:11:22] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:12:08] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:13:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53937 and previous config saved to /var/cache/conftool/dbconfig/20231129-091303-root.json [09:13:16] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.7 refs T350083 [09:13:21] T350083: 1.42.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T350083 [09:13:38] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:15:42] (SystemdUnitFailed) resolved: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:16:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 43.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:18:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 45.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:18:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1025.eqiad.wmnet with reason: host reimage [09:20:40] !log hashar@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.7 refs T350083 (duration: 07m 23s) [09:20:45] T350083: 1.42.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T350083 [09:21:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1025.eqiad.wmnet with reason: host reimage [09:23:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 45.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:23:59] my theory is that developers avoid merging code that I will have to deploy and postpone it to the next train [09:24:17] no errors reported so far, I will continue monitoring [09:28:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53938 and previous config saved to /var/cache/conftool/dbconfig/20231129-092808-root.json [09:32:58] (03Abandoned) 10Jbond: wmflib: add ord function [puppet] - 10https://gerrit.wikimedia.org/r/736212 (owner: 10Jbond) [09:33:34] (03CR) 10Jbond: [C: 03+2] admin::get_users: add function to get a list of configured users [puppet] - 10https://gerrit.wikimedia.org/r/690366 (owner: 10Jbond) [09:34:46] (03PS1) 10Majavah: hieradata: fix cluster assignment for many WMCS nodes [puppet] - 10https://gerrit.wikimedia.org/r/978475 [09:34:48] (03PS1) 10Majavah: P:wmcs: disable systemd icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/978476 (https://phabricator.wikimedia.org/T345294) [09:34:50] (03PS1) 10Majavah: openstack: remove redundant monitoring_enabled => false settings [puppet] - 10https://gerrit.wikimedia.org/r/978477 [09:35:29] (03CR) 10Jbond: [C: 03+2] P:httpbb: check for basicauth_credentials using defined [puppet] - 10https://gerrit.wikimedia.org/r/734999 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [09:36:22] (03CR) 10Jbond: [C: 03+2] R:icingamonitor::elasticsearch::cirrus_settings_check [puppet] - 10https://gerrit.wikimedia.org/r/735015 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [09:36:48] (03CR) 10Jbond: [C: 03+2] firmware fact: drop firmware_bios [puppet] - 10https://gerrit.wikimedia.org/r/765574 (owner: 10Jbond) [09:38:38] (03CR) 10Muehlenhoff: P:gerrit: Add logoutd script for gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) (owner: 10Jbond) [09:40:26] (03CR) 10Jbond: puppetserver: '/srv/puppet_code/environments' owned by puppet/puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975089 (https://phabricator.wikimedia.org/T351468) (owner: 10Andrew Bogott) [09:40:48] (03PS1) 10Slyngshede: C:netbox switch Netbox-Next to use plain OIDC [puppet] - 10https://gerrit.wikimedia.org/r/978479 (https://phabricator.wikimedia.org/T351950) [09:42:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1025.eqiad.wmnet with OS bookworm [09:42:34] (03CR) 10Jbond: [C: 03+2] P:trafficserver::backend: netbox-next switch to netbox-next.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/797320 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [09:44:57] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/753/con" [puppet] - 10https://gerrit.wikimedia.org/r/978479 (https://phabricator.wikimedia.org/T351950) (owner: 10Slyngshede) [09:45:28] (03CR) 10Jbond: [C: 03+1] "LGTM but having everything as wmcs may be a bit generic" [puppet] - 10https://gerrit.wikimedia.org/r/978475 (owner: 10Majavah) [09:45:36] (03CR) 10D3r1ck01: "Ack!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [09:45:46] (03PS7) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [09:46:06] (03PS8) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [09:48:49] (03CR) 10JMeybohm: [C: 03+1] k8s: allow setting prometheus retention in cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/977687 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [09:48:57] (03CR) 10JMeybohm: [C: 03+1] hieradata: set 850GB retention for prometheus@k8s [puppet] - 10https://gerrit.wikimedia.org/r/977688 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [09:57:28] (03PS1) 10Clément Goubert: mw-on-k8s: lower idle php-fpm requirement for mw-api-int [alerts] - 10https://gerrit.wikimedia.org/r/978481 [09:59:25] 10sre-alert-triage, 10Data-Platform-SRE: Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10BTullis) p:05Triage→03High [10:02:09] (03PS1) 10Vgutierrez: ncredir: Fix ncredir_ipip ferm rule saddr [puppet] - 10https://gerrit.wikimedia.org/r/978482 [10:02:32] (03PS6) 10Muehlenhoff: P:gerrit: Add logoutd script for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) (owner: 10Jbond) [10:02:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] changejob-jobqueue: move two more jobs to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/978032 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:03:18] (03CR) 10Muehlenhoff: P:gerrit: Add logoutd script for gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) (owner: 10Jbond) [10:03:24] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/978482 (owner: 10Vgutierrez) [10:05:32] (03CR) 10CI reject: [V: 04-1] P:gerrit: Add logoutd script for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) (owner: 10Jbond) [10:06:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/978479 (https://phabricator.wikimedia.org/T351950) (owner: 10Slyngshede) [10:06:03] (03CR) 10Majavah: [C: 03+2] hieradata: fix cluster assignment for many WMCS nodes [puppet] - 10https://gerrit.wikimedia.org/r/978475 (owner: 10Majavah) [10:08:34] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Fix ncredir_ipip ferm rule saddr [puppet] - 10https://gerrit.wikimedia.org/r/978482 (owner: 10Vgutierrez) [10:09:00] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [10:09:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, but I suggest to lower the threshold to 0.3 for mw-api-int" [alerts] - 10https://gerrit.wikimedia.org/r/978481 (owner: 10Clément Goubert) [10:09:59] (03PS1) 10Majavah: hieradata: WMCS servers exist in codfw [puppet] - 10https://gerrit.wikimedia.org/r/978483 [10:10:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:10:23] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:10:51] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:11:16] (03CR) 10Majavah: [C: 03+2] hieradata: WMCS servers exist in codfw [puppet] - 10https://gerrit.wikimedia.org/r/978483 (owner: 10Majavah) [10:13:27] (03CR) 10Muehlenhoff: Revert "hieradata: delete puppet7 hiera keys for planet hosts" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/978073 (owner: 10Dzahn) [10:13:50] (03PS1) 10Vgutierrez: realserver::ipip,ncredir: Move IP[6]IP[6] ferm rules to ipip profile [puppet] - 10https://gerrit.wikimedia.org/r/978484 [10:14:13] (03PS2) 10Vgutierrez: realserver::ipip,ncredir: Move IP[6]IP[6] ferm rules to ipip profile [puppet] - 10https://gerrit.wikimedia.org/r/978484 [10:15:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:16:54] (03CR) 10Majavah: "I fixed (and you merged) a different fix in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/80cda0b7e127d2326d31af63cfb" [puppet] - 10https://gerrit.wikimedia.org/r/734999 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [10:18:03] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:18:28] (03PS2) 10Clément Goubert: mw-on-k8s: lower idle php-fpm requirement for mw-api-int [alerts] - 10https://gerrit.wikimedia.org/r/978481 [10:18:46] (03PS3) 10Fabfur: decom cp1075-1090 [puppet] - 10https://gerrit.wikimedia.org/r/977702 (https://phabricator.wikimedia.org/T349244) [10:20:24] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: lower idle php-fpm requirement for mw-api-int [alerts] - 10https://gerrit.wikimedia.org/r/978481 (owner: 10Clément Goubert) [10:20:34] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: lower idle php-fpm requirement for mw-api-int (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/978481 (owner: 10Clément Goubert) [10:21:05] (03PS1) 10Kevin Bazira: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/978168 (https://phabricator.wikimedia.org/T343123) [10:21:33] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/978484 (owner: 10Vgutierrez) [10:21:43] (03Merged) 10jenkins-bot: mw-on-k8s: lower idle php-fpm requirement for mw-api-int [alerts] - 10https://gerrit.wikimedia.org/r/978481 (owner: 10Clément Goubert) [10:21:52] (03CR) 10Fabfur: [C: 03+2] decom cp1075-1090 [puppet] - 10https://gerrit.wikimedia.org/r/977702 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [10:22:10] (03PS1) 10Slyngshede: RAID - Add instance name to MD RAID alert summary [alerts] - 10https://gerrit.wikimedia.org/r/978485 [10:22:50] (03CR) 10Majavah: [V: 03+1 C: 03+1] "Might want to add an $ensure to the rules, otherwise looks good" [puppet] - 10https://gerrit.wikimedia.org/r/978484 (owner: 10Vgutierrez) [10:23:37] (03CR) 10Muehlenhoff: [C: 03+2] karapace: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/970732 (owner: 10Muehlenhoff) [10:26:17] (03PS3) 10Vgutierrez: realserver::ipip,ncredir: Move IP[6]IP[6] ferm rules to ipip profile [puppet] - 10https://gerrit.wikimedia.org/r/978484 [10:26:38] (03CR) 10Vgutierrez: realserver::ipip,ncredir: Move IP[6]IP[6] ferm rules to ipip profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/978484 (owner: 10Vgutierrez) [10:30:23] "Transaction round stage must be 'cursory' (not 'within-commit-callbacks')" [10:30:31] I love those cryptic messages :D [10:31:04] (03CR) 10Vgutierrez: [C: 03+2] realserver::ipip,ncredir: Move IP[6]IP[6] ferm rules to ipip profile [puppet] - 10https://gerrit.wikimedia.org/r/978484 (owner: 10Vgutierrez) [10:33:56] (03PS1) 10Marostegui: Revert "dbproxy1025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978079 [10:35:05] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/978168 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [10:35:41] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/978168 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [10:36:32] !log decommissioning cp1075-1090 (T352253) [10:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:39] T352253: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253 [10:36:48] (03Merged) 10jenkins-bot: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/978168 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [10:37:11] !log pausing all active dags on all airflow instances [10:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:52] !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts cp[1075-1090].eqiad.wmnet [10:39:02] (03CR) 10Btullis: [C: 03+2] Upgrade airflow on the analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/977632 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [10:40:24] (03CR) 10Volans: "I see that there are some weird users, with last login today:" [puppet] - 10https://gerrit.wikimedia.org/r/978479 (https://phabricator.wikimedia.org/T351950) (owner: 10Slyngshede) [10:40:26] (03CR) 10Majavah: [C: 04-1] scap3: stop defaulting deployment_group to 'wikidev' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [10:40:32] (03PS1) 10Jbond: Revert "P:httpbb: check for basicauth_credentials using defined" [puppet] - 10https://gerrit.wikimedia.org/r/978080 [10:41:20] (03CR) 10Majavah: [C: 03+1] Revert "P:httpbb: check for basicauth_credentials using defined" [puppet] - 10https://gerrit.wikimedia.org/r/978080 (owner: 10Jbond) [10:41:36] (03CR) 10Jbond: [C: 03+2] Revert "P:httpbb: check for basicauth_credentials using defined" [puppet] - 10https://gerrit.wikimedia.org/r/978080 (owner: 10Jbond) [10:41:54] (03PS1) 10Ladsgroup: beta: Enable temp users in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978488 [10:44:14] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978079 (owner: 10Marostegui) [10:44:29] PROBLEM - Check systemd state on mw1474 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:43] (03CR) 10Hnowlan: [C: 03+2] changejob-jobqueue: move two more jobs to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/978032 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:46:29] (03Merged) 10jenkins-bot: changejob-jobqueue: move two more jobs to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/978032 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:48:12] (03CR) 10Klausman: [C: 03+1] ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/978168 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [10:48:32] (03CR) 10Volans: [C: 03+1] "LGTM, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/978127 (owner: 10Cathal Mooney) [10:48:58] (03CR) 10EoghanGaffney: [C: 03+2] [admin] Add toyofuku to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/978463 (https://phabricator.wikimedia.org/T351857) (owner: 10EoghanGaffney) [10:49:03] (03CR) 10Ladsgroup: [C: 03+2] beta: Enable temp users in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978488 (owner: 10Ladsgroup) [10:49:18] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:49:56] (03Merged) 10jenkins-bot: beta: Enable temp users in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978488 (owner: 10Ladsgroup) [10:50:41] !log klausman@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:51:09] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [10:51:25] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [10:53:40] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] base::sysctl: Allow disabling rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/978088 (https://phabricator.wikimedia.org/T352160) (owner: 10Vgutierrez) [10:53:46] !log klausman@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:55:40] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Disable rp filter on ncredir@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/978038 (https://phabricator.wikimedia.org/T352160) (owner: 10Vgutierrez) [10:56:16] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [10:56:48] (03PS1) 10Fabfur: hiera: consolidate new cp hosts backend for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/978490 (https://phabricator.wikimedia.org/T352078) [10:56:53] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [10:57:16] !log upload ipip-multiqueue-optimizer 0.3+deb11u1 to apt.wm.o (bullseye) - T352249 [10:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:21] T352249: ipip-multiqueue-optimizer should unload eBPF programs on service stop - https://phabricator.wikimedia.org/T352249 [10:59:52] (03CR) 10Btullis: [C: 03+2] Upgrade airflow on the search instance [puppet] - 10https://gerrit.wikimedia.org/r/977633 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [11:00:02] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/756/console" [puppet] - 10https://gerrit.wikimedia.org/r/978490 (https://phabricator.wikimedia.org/T352078) (owner: 10Fabfur) [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T1100) [11:00:21] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] hiera: consolidate new cp hosts backend for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/978490 (https://phabricator.wikimedia.org/T352078) (owner: 10Fabfur) [11:00:41] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/755/console" [puppet] - 10https://gerrit.wikimedia.org/r/978490 (https://phabricator.wikimedia.org/T352078) (owner: 10Fabfur) [11:00:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5 CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/724733 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [11:01:52] (03CR) 10Jbond: "hi riccardo i think this is ready to merge as is but perhaps just got missed can you take a new ass" [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [11:02:10] (03Abandoned) 10Jbond: Revert "netbox: add hostname to allowed list of hostnames" [puppet] - 10https://gerrit.wikimedia.org/r/806253 (owner: 10Jbond) [11:02:38] (03PS2) 10Jbond: netbox: update netbox service definition so it pages [puppet] - 10https://gerrit.wikimedia.org/r/808197 (https://phabricator.wikimedia.org/T296452) [11:03:08] (03CR) 10Jbond: "let me know what you think on about this" [puppet] - 10https://gerrit.wikimedia.org/r/808197 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:04:16] (03CR) 10Btullis: [C: 03+2] Upgrade airflow on the research instance [puppet] - 10https://gerrit.wikimedia.org/r/977634 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [11:04:22] (03PS1) 10Muehlenhoff: ganeti: Switch ulsfo to PKI [puppet] - 10https://gerrit.wikimedia.org/r/978491 (https://phabricator.wikimedia.org/T350686) [11:04:27] (03CR) 10Btullis: [C: 03+2] Upgrade airflow on the platform_eng instance [puppet] - 10https://gerrit.wikimedia.org/r/977635 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [11:04:38] (03CR) 10Btullis: [C: 03+2] Upgrade airflow on the analytics_product [puppet] - 10https://gerrit.wikimedia.org/r/977636 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [11:05:35] (03PS2) 10Slyngshede: Keymanagement: SSH keys are in some cases not synced to LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/978056 (https://phabricator.wikimedia.org/T351139) [11:07:09] (03CR) 10Fabfur: [V: 03+1 C: 03+2] hiera: consolidate new cp hosts backend for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/978490 (https://phabricator.wikimedia.org/T352078) (owner: 10Fabfur) [11:09:57] (JobUnavailable) firing: (5) Reduced availability for job pybal in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:10:43] (03PS17) 10Jbond: puppet: add wrapper command [puppet] - 10https://gerrit.wikimedia.org/r/808877 [11:11:11] (03CR) 10CI reject: [V: 04-1] puppet: add wrapper command [puppet] - 10https://gerrit.wikimedia.org/r/808877 (owner: 10Jbond) [11:12:00] (03PS2) 10KartikMistry: Update cxserver to 2023-11-28-064518-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) [11:12:25] !log re-enabled all DAGs on all airflow instances after airflow upgrade to 2.7.3 [11:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:04] (03CR) 10Btullis: [C: 03+2] Upgrade airflow on wmde [puppet] - 10https://gerrit.wikimedia.org/r/977637 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [11:13:12] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:13:35] (03Abandoned) 10Jbond: puppet: add wrapper command [puppet] - 10https://gerrit.wikimedia.org/r/808877 (owner: 10Jbond) [11:13:37] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:14:08] (03Abandoned) 10Jbond: puppetmaster: update private repo pre-commit to error un-staged [puppet] - 10https://gerrit.wikimedia.org/r/803560 (owner: 10Jbond) [11:14:35] (03CR) 10Jbond: "@riccardo is this of interest or should i abandon" [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond) [11:14:53] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:15:19] (03CR) 10Jbond: "is this of interest, if not ill abandon" [puppet] - 10https://gerrit.wikimedia.org/r/815761 (owner: 10Jbond) [11:15:19] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:15:47] hnowlan: Can you please review: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/977983 (cxserver: Use MediaWiki REST API endpoint instead of RESTbase) [11:15:56] (03Abandoned) 10Jbond: P:apt: apply apt before any packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/819581 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond) [11:18:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 41.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:18:41] (03CR) 10Volans: remote: add an __iter__ to RemoteHosts (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond) [11:18:56] (03PS2) 10Jbond: P:redis::slave: pass the password [puppet] - 10https://gerrit.wikimedia.org/r/823181 (https://phabricator.wikimedia.org/T228266) [11:19:56] (03Abandoned) 10Jbond: P:redis::slave: drop use of inline_template [puppet] - 10https://gerrit.wikimedia.org/r/823182 (owner: 10Jbond) [11:19:59] (03CR) 10Ladsgroup: "if you review the core patch too, that'd be amazing so I can deploy this 😄" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976787 (https://phabricator.wikimedia.org/T351559) (owner: 10Ladsgroup) [11:20:29] 10SRE, 10DC-Ops, 10Infrastructure-Foundations: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036 (10Volans) p:05Triage→03Low Given no objections I went ahead and fixed ALL names and slug to adhere to the standard. Triaging as low and leaving the task... [11:20:57] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netbox: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036 (10Volans) [11:21:24] !log upload tcp-mss-clamper 0.3+deb12u1 to apt.wm.o (bookworm) - T352249 [11:21:24] (03PS3) 10Jbond: P:redis::slave: pass the password [puppet] - 10https://gerrit.wikimedia.org/r/823181 (https://phabricator.wikimedia.org/T228266) [11:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:29] T352249: ipip-multiqueue-optimizer should unload eBPF programs on service stop - https://phabricator.wikimedia.org/T352249 [11:21:36] 10ops-eqiad, 10DC-Ops, 10Traffic: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253 (10Fabfur) [11:22:38] (03CR) 10Jbond: "let me know if this is of interest if not ill abandon" [puppet] - 10https://gerrit.wikimedia.org/r/823181 (https://phabricator.wikimedia.org/T228266) (owner: 10Jbond) [11:22:47] PROBLEM - Check whether ferm is active by checking the default input chain on mw1474 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:23:06] (03Abandoned) 10Jbond: deployment-prep: use pki for etcd certificates in deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/824158 (https://phabricator.wikimedia.org/T315395) (owner: 10Jbond) [11:23:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:23:17] !log fabfur@cumin1001 START - Cookbook sre.dns.netbox [11:23:22] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for SToyofuku-WMF - https://phabricator.wikimedia.org/T351857 (10eoghan) 05Open→03Resolved a:03eoghan I've granted the access in LDAP and added to the phabricator WMF group. @SToyofuku-WMF, please reopen and let us know if there's anything not wor... [11:25:03] (03CR) 10Jbond: [V: 03+1] "let me know if this is desirable if not ill abandon" [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [11:25:12] (03PS1) 10Clément Goubert: mw-on-k8s: lower idle php-fpm requirement for jobrunner [alerts] - 10https://gerrit.wikimedia.org/r/978496 [11:26:06] (03PS2) 10Clément Goubert: mw-on-k8s: lower idle php-fpm requirement for jobrunner [alerts] - 10https://gerrit.wikimedia.org/r/978496 [11:26:25] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [11:26:55] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [11:27:04] (03PS2) 10Jbond: P:cache::varnish::frontend: remove parse_abuse nets [puppet] - 10https://gerrit.wikimedia.org/r/817299 [11:27:20] (03CR) 10Hnowlan: [C: 03+1] mw-on-k8s: lower idle php-fpm requirement for jobrunner [alerts] - 10https://gerrit.wikimedia.org/r/978496 (owner: 10Clément Goubert) [11:29:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 40.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:29:15] (03PS3) 10Jbond: P:cache::varnish::frontend: remove parse_abuse nets [puppet] - 10https://gerrit.wikimedia.org/r/817299 [11:31:06] (03Abandoned) 10Jbond: P:cache::varnish::frontend: remove parse_abuse nets [puppet] - 10https://gerrit.wikimedia.org/r/817299 (owner: 10Jbond) [11:31:29] (03Abandoned) 10Jbond: gerrit: add mock secrets [labs/private] - 10https://gerrit.wikimedia.org/r/832264 (owner: 10Jbond) [11:34:28] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: lower idle php-fpm requirement for jobrunner [alerts] - 10https://gerrit.wikimedia.org/r/978496 (owner: 10Clément Goubert) [11:35:42] (03Merged) 10jenkins-bot: mw-on-k8s: lower idle php-fpm requirement for jobrunner [alerts] - 10https://gerrit.wikimedia.org/r/978496 (owner: 10Clément Goubert) [11:35:54] (03Abandoned) 10Jbond: test_syncer: genralise temp data context manager [software/conftool] - 10https://gerrit.wikimedia.org/r/836883 (owner: 10Jbond) [11:37:09] (MediaWikiLatencyExceeded) firing: Average latency high: codfw mw-jobrunner (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:37:13] (03CR) 10Jbond: [V: 03+1] "This would be a nice one to refresh at some point. the biggest issue with this is the cross validation but it could ultimately mean that " [puppet] - 10https://gerrit.wikimedia.org/r/761340 (https://phabricator.wikimedia.org/T301349) (owner: 10Jbond) [11:38:39] (03CR) 10Jbond: [V: 03+1] P:base::production: update hiera preference public vs private (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761340 (https://phabricator.wikimedia.org/T301349) (owner: 10Jbond) [11:39:28] (03Abandoned) 10Jbond: C:varnish: Rate limit hotlinking dry-run [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [11:39:40] (03Abandoned) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/832621 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [11:40:24] (03CR) 10Volans: netbox: update netbox service definition so it pages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/808197 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:41:17] (03CR) 10Jbond: "i don't think this is supper usefull but wanted to get your eyes before i abbandon in case it solves a problems in dcl" [puppet] - 10https://gerrit.wikimedia.org/r/841479 (owner: 10Jbond) [11:42:06] (03PS1) 10Clément Goubert: mw-on-k8s: Exclude jobrunner from latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/978497 [11:43:38] (03CR) 10Jbond: "@Emperor i created this when debugging swift disks. do you thin its useful to have if so i can clean it up" [cookbooks] - 10https://gerrit.wikimedia.org/r/848427 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [11:44:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 43.06% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:44:44] (03CR) 10Hnowlan: [C: 04-1] Update cxserver to 2023-11-28-064518-production (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [11:45:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 48.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:45:17] (03Abandoned) 10Jbond: motd::script: update define to all interpreted strings [puppet] - 10https://gerrit.wikimedia.org/r/842497 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [11:45:35] (03PS1) 10Kevin Bazira: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/978171 (https://phabricator.wikimedia.org/T343123) [11:45:51] (03CR) 10Hnowlan: [C: 03+1] mw-on-k8s: Exclude jobrunner from latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/978497 (owner: 10Clément Goubert) [11:46:04] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Exclude jobrunner from latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/978497 (owner: 10Clément Goubert) [11:46:51] (03PS2) 10Jbond: R:system::role: colour system role based on its name [puppet] - 10https://gerrit.wikimedia.org/r/849497 (https://phabricator.wikimedia.org/T320696) [11:47:09] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw mw-jobrunner (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:47:18] (03Merged) 10jenkins-bot: mw-on-k8s: Exclude jobrunner from latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/978497 (owner: 10Clément Goubert) [11:47:22] 10SRE, 10SRE-Access-Requests: Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - https://phabricator.wikimedia.org/T352098 (10eoghan) [11:47:22] (03CR) 10Jbond: "let me know what you think" [puppet] - 10https://gerrit.wikimedia.org/r/849497 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [11:47:35] (03Abandoned) 10Jbond: used to test Ifa35d19910c9c162ef25c59da55b1588d281bccd [labs/private] - 10https://gerrit.wikimedia.org/r/852994 (owner: 10Jbond) [11:47:55] (03Abandoned) 10Jbond: P:contact: do not merge [puppet] - 10https://gerrit.wikimedia.org/r/852982 (owner: 10Jbond) [11:48:06] (03Abandoned) 10Jbond: used to test Ifa35d19910c9c162ef25c59da55b1588d281bccd [puppet] - 10https://gerrit.wikimedia.org/r/852995 (owner: 10Jbond) [11:48:14] (03PS2) 10Jbond: C:raid::mdadm: remove daily cron job [puppet] - 10https://gerrit.wikimedia.org/r/853307 (https://phabricator.wikimedia.org/T169564) [11:48:42] (03CR) 10Jbond: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/853307 (https://phabricator.wikimedia.org/T169564) (owner: 10Jbond) [11:49:10] (03Abandoned) 10Jbond: testing: add files useful for testing locally [software/cas-overlay-template] (testing) - 10https://gerrit.wikimedia.org/r/856563 (owner: 10Jbond) [11:49:24] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 47.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:49:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [11:51:16] (03Abandoned) 10Jbond: cergen: add Icinga check to validate the expiry date on certificates [puppet] - 10https://gerrit.wikimedia.org/r/552260 (https://phabricator.wikimedia.org/T238833) (owner: 10Jbond) [11:52:54] (03CR) 10Slyngshede: [V: 03+1] C:netbox switch Netbox-Next to use plain OIDC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/978479 (https://phabricator.wikimedia.org/T351950) (owner: 10Slyngshede) [11:53:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 48.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:53:11] 10SRE, 10SRE-Access-Requests: Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - https://phabricator.wikimedia.org/T352098 (10eoghan) Hi @odimitrijevic and @Milimetric! Can one of you approve this request for the `analytics-privatedata-users` group access, please? [11:57:11] (03PS1) 10Hnowlan: mw-jobrunner: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/978500 (https://phabricator.wikimedia.org/T349796) [12:00:57] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:01:26] (03CR) 10Jbond: [C: 03+2] gerrit: change scap user to gerrit-deploy [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [12:02:09] !log Disabled Puppet agent on gerrit1003 and gerrit2002 to roll https://gerrit.wikimedia.org/r/844998 which requires some manual steps | T317412 [12:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:21] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:02:22] T317412: Automate Gerrit deployment steps - https://phabricator.wikimedia.org/T317412 [12:02:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:03:04] (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/978171 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [12:03:39] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.232 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:03:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.327 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:04:00] (03Merged) 10jenkins-bot: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/978171 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [12:04:31] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[1075-1090].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001" [12:04:54] (03PS6) 10Jbond: spicerack: add monitoring for sre.puppet.netbox-sync [puppet] - 10https://gerrit.wikimedia.org/r/860019 [12:04:56] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond) [12:05:05] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:05:09] (MediaWikiLatencyExceeded) firing: Average latency high: codfw mw-jobrunner (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:05:22] (03Abandoned) 10Jbond: CI - puppet-lint: Add puppet-lint-param-docs [puppet] - 10https://gerrit.wikimedia.org/r/862855 (https://phabricator.wikimedia.org/T127797) (owner: 10Jbond) [12:05:30] (03Abandoned) 10Jbond: do not merge: test puppet-lint-param-docs [puppet] - 10https://gerrit.wikimedia.org/r/862856 (https://phabricator.wikimedia.org/T127797) (owner: 10Jbond) [12:05:41] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[1075-1090].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001" [12:05:41] !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:05:41] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[1075-1090].eqiad.wmnet [12:05:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `cp[1075-1090].eqiad.wmnet` - cp1075.eqiad.wmnet (**PASS**) - Downtimed hos... [12:05:50] (03Abandoned) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) (owner: 10Jbond) [12:05:59] hnowlan: Thanks! [12:06:59] (03CR) 10Btullis: [C: 03+2] C:statistics::compute: correct user param [puppet] - 10https://gerrit.wikimedia.org/r/735029 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [12:08:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 45.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:09:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [12:09:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:10:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) 05Open→03Resolved All activities for this task have been completed, refer to the other linked tasks for more details on decommissioning old hosts [12:10:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Fabfur) [12:10:55] (03PS1) 10Clément Goubert: mw-on-k8s: fix mw-jobrunner rules [alerts] - 10https://gerrit.wikimedia.org/r/978502 [12:11:37] PROBLEM - Check systemd state on ncredir4001 is CRITICAL: CRITICAL - degraded: The following units failed: tcp-mss-clamper.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:39] (03PS1) 10Vgutierrez: Revert "Revert "Revert "service: Disable IPIP encapsulation for ncredir@ulsfo""" [puppet] - 10https://gerrit.wikimedia.org/r/978085 [12:12:57] ^^ ncredir4001 is me and expected [12:13:10] (03PS3) 10Jbond: cfssl::cert: add ability to renew based on a relative value [puppet] - 10https://gerrit.wikimedia.org/r/866602 [12:13:20] ack [12:13:24] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 45.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:13:27] (03CR) 10Vgutierrez: [C: 03+2] Revert "Revert "Revert "service: Disable IPIP encapsulation for ncredir@ulsfo""" [puppet] - 10https://gerrit.wikimedia.org/r/978085 (owner: 10Vgutierrez) [12:13:37] (03CR) 10Jbond: "something i was working on let me know what you think" [puppet] - 10https://gerrit.wikimedia.org/r/866602 (owner: 10Jbond) [12:13:45] (03CR) 10Hnowlan: [C: 03+1] mw-on-k8s: fix mw-jobrunner rules [alerts] - 10https://gerrit.wikimedia.org/r/978502 (owner: 10Clément Goubert) [12:14:02] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: fix mw-jobrunner rules [alerts] - 10https://gerrit.wikimedia.org/r/978502 (owner: 10Clément Goubert) [12:15:09] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw mw-jobrunner (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:15:16] (03Merged) 10jenkins-bot: mw-on-k8s: fix mw-jobrunner rules [alerts] - 10https://gerrit.wikimedia.org/r/978502 (owner: 10Clément Goubert) [12:17:13] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:17:23] RECOVERY - Check systemd state on ncredir4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:46] (03PS3) 10Jbond: cumin::master: WIP/PoC make profile a bit more DRY [puppet] - 10https://gerrit.wikimedia.org/r/867602 [12:19:15] (03CR) 10CI reject: [V: 04-1] cumin::master: WIP/PoC make profile a bit more DRY [puppet] - 10https://gerrit.wikimedia.org/r/867602 (owner: 10Jbond) [12:21:08] (03CR) 10Muehlenhoff: C:raid::mdadm: remove daily cron job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/853307 (https://phabricator.wikimedia.org/T169564) (owner: 10Jbond) [12:22:27] !log hashar@deploy2002 Started deploy [gerrit/gerrit@a087269]: Verify scap deployment after changing the scap user from gerrit2 to gerrit-deploy - T317412 [12:22:32] T317412: Automate Gerrit deployment steps - https://phabricator.wikimedia.org/T317412 [12:22:42] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@a087269]: Verify scap deployment after changing the scap user from gerrit2 to gerrit-deploy - T317412 (duration: 00m 15s) [12:24:17] (03Abandoned) 10Jbond: cumin::master: WIP/PoC make profile a bit more DRY [puppet] - 10https://gerrit.wikimedia.org/r/867602 (owner: 10Jbond) [12:24:29] (03PS1) 10Btullis: Deploy kube-state-metrics to the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/978504 (https://phabricator.wikimedia.org/T264625) [12:24:37] 10SRE, 10Traffic, 10Patch-For-Review: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers - https://phabricator.wikimedia.org/T352160 (10Vgutierrez) 05Open→03Resolved [12:24:41] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [12:24:53] (03Abandoned) 10Jbond: O:cluster::cloud_managment: remove unneeded profiles [puppet] - 10https://gerrit.wikimedia.org/r/868451 (owner: 10Jbond) [12:24:56] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [12:24:57] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:25:04] (03CR) 10Cathal Mooney: [C: 03+2] Reset spine switch BGP to CR if max prefix tripped after 30 mins [homer/public] - 10https://gerrit.wikimedia.org/r/975799 (https://phabricator.wikimedia.org/T349116) (owner: 10Cathal Mooney) [12:25:06] (03Abandoned) 10Jbond: ferm: example using new ferm::qos resource [puppet] - 10https://gerrit.wikimedia.org/r/868674 (owner: 10Jbond) [12:25:08] (03CR) 10Muehlenhoff: "Looks good, but we also need a default for cloud, right?" [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond) [12:25:22] (03Abandoned) 10Jbond: P:sretest: test the new ips functions [puppet] - 10https://gerrit.wikimedia.org/r/868654 (owner: 10Jbond) [12:25:31] (03PS1) 10Hashar: scap: change deploy user from gerrit2 to gerrit-deploy [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/978505 (https://phabricator.wikimedia.org/T317412) [12:25:38] !log rolling restart of pybal on lvs4008 and lvs4010, effectively enabling IPIP encapsulation for ncredir@ulsfo - T351069 [12:25:40] (03Merged) 10jenkins-bot: Reset spine switch BGP to CR if max prefix tripped after 30 mins [homer/public] - 10https://gerrit.wikimedia.org/r/975799 (https://phabricator.wikimedia.org/T349116) (owner: 10Cathal Mooney) [12:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:55] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [12:26:02] (03CR) 10Hashar: [C: 03+2] scap: change deploy user from gerrit2 to gerrit-deploy [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/978505 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [12:26:18] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [12:26:34] (03Merged) 10jenkins-bot: scap: change deploy user from gerrit2 to gerrit-deploy [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/978505 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [12:27:07] (03Abandoned) 10Jbond: django-sso: improve debug page [puppet] - 10https://gerrit.wikimedia.org/r/869857 (owner: 10Jbond) [12:28:11] (03PS3) 10Jbond: cumin::target: use concat to manage the file [puppet] - 10https://gerrit.wikimedia.org/r/874859 (https://phabricator.wikimedia.org/T323483) [12:29:04] (03CR) 10Jbond: "let me know if this is useful. i think the idea was to support multiple keys" [puppet] - 10https://gerrit.wikimedia.org/r/874859 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [12:29:17] (03Abandoned) 10Jbond: profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T274230) (owner: 10Jbond) [12:29:29] RECOVERY - Check systemd state on mw1474 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:51] (03PS1) 10Hashar: Add .gitreview file [software/gerrit/tools/gervert/deploy] - 10https://gerrit.wikimedia.org/r/978526 [12:29:53] (03PS1) 10Hashar: Change deploy user from gerrit2 to gerrit-deploy [software/gerrit/tools/gervert/deploy] - 10https://gerrit.wikimedia.org/r/978527 (https://phabricator.wikimedia.org/T317412) [12:30:50] (03CR) 10CI reject: [V: 04-1] cumin::target: use concat to manage the file [puppet] - 10https://gerrit.wikimedia.org/r/874859 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [12:33:13] (03PS3) 10Jbond: haproxy: force a full restart when /etc/defaults/haproxy is changed [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) [12:34:02] (03CR) 10Jbond: "please provide some input in if this cr is desirable or if it should be abandoned" [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond) [12:34:31] (03PS1) 10Hashar: Use gerrit dsh group from deployment server [software/gerrit/tools/gervert/deploy] - 10https://gerrit.wikimedia.org/r/978528 [12:34:48] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [12:35:02] !log hashar@deploy2002 Started deploy [gervert/deploy@ca6bba0]: Verify scap deployment after changing the scap user from gerrit2 to gerrit-deploy - T317412 [12:35:08] T317412: Automate Gerrit deployment steps - https://phabricator.wikimedia.org/T317412 [12:35:14] !log hashar@deploy2002 Finished deploy [gervert/deploy@ca6bba0]: Verify scap deployment after changing the scap user from gerrit2 to gerrit-deploy - T317412 (duration: 00m 12s) [12:35:25] (03CR) 10Jbond: "i think it would still be useful to do this clean up but also happy to abbandon" [puppet] - 10https://gerrit.wikimedia.org/r/884040 (owner: 10Jbond) [12:35:53] (03CR) 10Jbond: "this really depends on the other CR in the chain" [puppet] - 10https://gerrit.wikimedia.org/r/888198 (owner: 10Jbond) [12:35:55] !log hashar@deploy2002 Started deploy [gerrit/gerrit@6b23c27]: Verify scap deployment after changing the scap user from gerrit2 to gerrit-deploy - T317412 [12:36:01] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@6b23c27]: Verify scap deployment after changing the scap user from gerrit2 to gerrit-deploy - T317412 (duration: 00m 06s) [12:36:16] (03CR) 10Jbond: [V: 03+1] "looks like i allready have a cr for this" [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [12:36:38] (03Abandoned) 10Jbond: sre.hardware.upgrade-firmware: always download latest file [cookbooks] - 10https://gerrit.wikimedia.org/r/889541 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond) [12:36:53] (03Abandoned) 10Jbond: posix_acl: add module to manage posix file system ACLs [puppet] - 10https://gerrit.wikimedia.org/r/889563 (https://phabricator.wikimedia.org/T113979) (owner: 10Jbond) [12:38:30] (03CR) 10Jbond: "is there intrest in progressing with this? if so i can address the comments" [puppet] - 10https://gerrit.wikimedia.org/r/634572 (owner: 10Jbond) [12:38:41] (03PS1) 10Hnowlan: geo-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/978529 [12:39:18] (03CR) 10Jbond: "@Emperor is this still useful?" [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [12:39:50] (03CR) 10Jbond: "is this still useful? looks complete" [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 (owner: 10Jbond) [12:40:06] I am applying a puppet change to Gerrit / gerrit1003 to slightly change how it is deployed but that should not affect the service [12:40:50] (03CR) 10Jbond: "are you interested in this? if so ill clean it up" [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [12:40:55] (03CR) 10MVernon: "Yes, definitely! I'm expecting to want to use it on some moss backends quite soon." [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [12:43:54] !log hashar@deploy2002 Started deploy [gerrit/gerrit@6b23c27]: Verify scap deployment after changing the scap user from gerrit2 to gerrit-deploy - T317412 [12:43:59] T317412: Automate Gerrit deployment steps - https://phabricator.wikimedia.org/T317412 [12:44:01] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@6b23c27]: Verify scap deployment after changing the scap user from gerrit2 to gerrit-deploy - T317412 (duration: 00m 07s) [12:45:35] (03CR) 10Jbond: "@volans can you take an extra pass on this, should be good to merge" [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [12:45:42] (03CR) 10MVernon: "I've often done equivalent by eyeballing `df -lh` output when rebooting swift backends, so a "proper" tool will continue to be useful whil" [cookbooks] - 10https://gerrit.wikimedia.org/r/848427 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [12:46:21] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Switch ulsfo to PKI [puppet] - 10https://gerrit.wikimedia.org/r/978491 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [12:46:52] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:47:02] 10SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wikimedia Bénin user group - https://phabricator.wikimedia.org/T352285 (10Mh-3110) [12:49:48] 10SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wikimedia Bénin user group - https://phabricator.wikimedia.org/T352285 (10Ladsgroup) a:03Ladsgroup Hi, 1 - Is this for https://meta.wikimedia.org/wiki/Wikim%C3%A9diens_du_B%C3%A9nin_User_Group or a different UG? 2- Just to double check, you need a... [12:51:25] PROBLEM - HTTPS Ganeti RAPI ulsfo on ganeti4005 is CRITICAL: connect to address ganeti01.svc.ulsfo.wmnet and port 5080: Connection refused https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [12:51:48] ^ RAPI is expected, WIP [12:52:53] RECOVERY - HTTPS Ganeti RAPI ulsfo on ganeti4005 is OK: HTTP OK: Status line output matched 401 - 308 bytes in 0.016 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [12:52:55] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:52:58] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:53:59] RECOVERY - Check whether ferm is active by checking the default input chain on mw1474 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:55:29] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [12:55:31] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_ulsfo_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:52] (03CR) 10MVernon: P:swift::storage: Use disk_type to identify swift disks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792227 (https://phabricator.wikimedia.org/T300057) (owner: 10Jbond) [12:57:51] PROBLEM - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:58:03] !log restoring DB snapshot from 11:37 UTC to netboxdb1002 [12:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow4002.ulsfo.wmnet [13:01:01] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:01:03] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:01:17] (03PS4) 10Jbond: admin: update admin modules so users without sshkeys get nologin shell [puppet] - 10https://gerrit.wikimedia.org/r/666367 [13:01:25] (03CR) 10MVernon: POC: P:thanos::swift::frontend: move ring manager config to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773794 (owner: 10Jbond) [13:03:44] (03PS5) 10Jbond: admin: update admin modules so users without sshkeys get nologin shell [puppet] - 10https://gerrit.wikimedia.org/r/666367 [13:05:00] 10SRE, 10Infrastructure-Foundations: Restore Netbox DB from before lsw1-e1-eqiad was removed - https://phabricator.wikimedia.org/T352286 (10cmooney) p:05Triage→03High [13:05:18] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on netbox1002.eqiad.wmnet with reason: Restoring DB from backup on netboxdb1002 [13:05:33] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on netbox1002.eqiad.wmnet with reason: Restoring DB from backup on netboxdb1002 [13:05:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow4002.ulsfo.wmnet [13:05:38] 10SRE, 10Infrastructure-Foundations: Restore Netbox DB from before lsw1-e1-eqiad was removed - https://phabricator.wikimedia.org/T352286 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ae2a2be7-cae0-4e62-92ca-558767ce7e8f) set by cmooney@cumin1001 for 0:20:00 on 1 host(s) and their services... [13:05:41] (03CR) 10Hashar: scap3: stop defaulting deployment_group to 'wikidev' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [13:07:07] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:24] (03PS6) 10Jbond: admin: update admin modules so users without sshkeys get nologin shell [puppet] - 10https://gerrit.wikimedia.org/r/666367 [13:08:17] RECOVERY - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:09:30] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:09:32] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:09:39] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:10:02] (03PS5) 10Hashar: fix-staging-perms: set group name from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) [13:10:04] (03PS7) 10Hashar: scap3: stop defaulting deployment_group to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) [13:10:06] (03PS8) 10Hashar: fix-staging-perms: set set-group-id on /srv/patches subdirs [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) [13:10:59] (03CR) 10CI reject: [V: 04-1] admin: update admin modules so users without sshkeys get nologin shell [puppet] - 10https://gerrit.wikimedia.org/r/666367 (owner: 10Jbond) [13:11:18] (03PS1) 10Esanders: Enable DT visual enhancements on pages with __NEWSECTIONLINK__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978531 (https://phabricator.wikimedia.org/T352232) [13:12:15] (03CR) 10Hashar: scap3: stop defaulting deployment_group to 'wikidev' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [13:14:40] (03PS7) 10Jbond: admin: update admin modules so users without sshkeys get nologin shell [puppet] - 10https://gerrit.wikimedia.org/r/666367 [13:14:57] (JobUnavailable) firing: (6) Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:16:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/762/con" [puppet] - 10https://gerrit.wikimedia.org/r/666367 (owner: 10Jbond) [13:18:13] (03CR) 10Jbond: [V: 03+1] "This should be ready for review now. the pcc diff is big as all the absented users now get a shell of nologin" [puppet] - 10https://gerrit.wikimedia.org/r/666367 (owner: 10Jbond) [13:20:03] (03CR) 10Clément Goubert: [C: 03+1] fix-staging-perms: set group name from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [13:21:35] (03CR) 10Clément Goubert: [C: 03+1] scap3: stop defaulting deployment_group to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [13:22:10] (03CR) 10Jbond: [V: 03+1] P:base::production: move system::role to profile::base::production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond) [13:23:14] (03CR) 10Slyngshede: [C: 03+2] Ensure that build directories are cleaned up [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/977672 (https://phabricator.wikimedia.org/T348974) (owner: 10Slyngshede) [13:23:38] (03CR) 10Clément Goubert: [C: 03+1] fix-staging-perms: set set-group-id on /srv/patches subdirs [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [13:24:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I'll take care of moving away from system::role in followups." [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond) [13:25:53] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service,netbox_ganeti_codfw_test_sync.service,netbox_ganeti_drmrs01_sync.service,netbox_ganeti_drmrs02_sync.service,netbox_ganeti_eqiad_sync.service,netbox_ganeti_eqsin_sync.service,netbox_ganeti_esams01_sync.service,netbox_ganeti_esams02_sync.service,netbox_ganeti_ulsfo_sync.service https://wikitech.wikimedia [13:25:53] i/Monitoring/check_systemd_state [13:26:03] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:26:20] (03CR) 10Clément Goubert: [C: 03+2] fix-staging-perms: set group name from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [13:27:17] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:27:47] PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:28:21] PROBLEM - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:28:42] (03Merged) 10jenkins-bot: Ensure that build directories are cleaned up [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/977672 (https://phabricator.wikimedia.org/T348974) (owner: 10Slyngshede) [13:28:57] PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:28:59] PROBLEM - Check unit status of netbox_ganeti_esams02_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:29:29] (03PS4) 10Jbond: sre.swift.audit-labels: Audit the disk labels of swift backend hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/848427 (https://phabricator.wikimedia.org/T308677) [13:29:33] PROBLEM - Check unit status of netbox_ganeti_esams01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:29:57] (JobUnavailable) firing: (6) Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:30:01] !log jbond@cumin1001 START - Cookbook sre.swift.audit-labels for host ms-be[2044-2073].codfw.wmnet,ms-be[1044-1075].eqiad.wmnet [13:30:02] !log jbond@cumin1001 END (FAIL) - Cookbook sre.swift.audit-labels (exit_code=99) for host ms-be[2044-2073].codfw.wmnet,ms-be[1044-1075].eqiad.wmnet [13:30:37] PROBLEM - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:33:42] (03PS5) 10Jbond: sre.swift.audit-labels: Audit the disk labels of swift backend hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/848427 (https://phabricator.wikimedia.org/T308677) [13:33:48] !log jbond@cumin1001 START - Cookbook sre.swift.audit-labels for host ms-be[2044-2073].codfw.wmnet,ms-be[1044-1075].eqiad.wmnet [13:33:49] !log jbond@cumin1001 END (PASS) - Cookbook sre.swift.audit-labels (exit_code=0) for host ms-be[2044-2073].codfw.wmnet,ms-be[1044-1075].eqiad.wmnet [13:34:31] PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:34:51] (03PS6) 10Jbond: sre.swift.audit-labels: Audit the disk labels of swift backend hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/848427 (https://phabricator.wikimedia.org/T308677) [13:35:07] (03CR) 10Jbond: "This should be good to review now" [cookbooks] - 10https://gerrit.wikimedia.org/r/848427 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [13:36:02] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:base::production: move system::role to profile::base::production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond) [13:36:34] (03PS1) 10Majavah: hieradata: unconfigure wiki replica LVS services [puppet] - 10https://gerrit.wikimedia.org/r/978539 (https://phabricator.wikimedia.org/T346947) [13:36:49] (03PS1) 10Muehlenhoff: mediawiki::php: Set php-common version dependent on OS [puppet] - 10https://gerrit.wikimedia.org/r/978540 [13:38:13] RECOVERY - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:38:47] RECOVERY - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:39:25] RECOVERY - Check unit status of netbox_ganeti_esams02_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_esams02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:39:59] RECOVERY - Check unit status of netbox_ganeti_esams01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_esams01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:40:32] (03CR) 10Elukey: [C: 03+2] profile::pyrra::filesystem: reduce scope of the Lift Wing's pilot [puppet] - 10https://gerrit.wikimedia.org/r/977738 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [13:41:03] RECOVERY - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:41:10] (03PS1) 10Hashar: fix-staging-perms: chgrp symbolic link, not its target! [puppet] - 10https://gerrit.wikimedia.org/r/978541 (https://phabricator.wikimedia.org/T338205) [13:41:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978540 (owner: 10Muehlenhoff) [13:42:45] !log installing tiff security updates [13:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:17] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:22] (03CR) 10Clément Goubert: [C: 03+2] fix-staging-perms: chgrp symbolic link, not its target! [puppet] - 10https://gerrit.wikimedia.org/r/978541 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [13:43:29] (03CR) 10Jbond: [V: 03+1] P:swift::storage: Use disk_type to identify swift disks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792227 (https://phabricator.wikimedia.org/T300057) (owner: 10Jbond) [13:44:57] RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:46:55] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:47:24] (03CR) 10Ayounsi: diffscan: pyhotnify (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634572 (owner: 10Jbond) [13:49:01] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:49:32] (03CR) 10Ayounsi: "not sure it's really needed." [puppet] - 10https://gerrit.wikimedia.org/r/849497 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [13:49:47] RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:50:34] (03CR) 10Clément Goubert: [C: 03+2] scap3: stop defaulting deployment_group to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [13:51:27] (03CR) 10Ayounsi: [C: 04-1] "I'd say no, as netbox going down is not user facing." [puppet] - 10https://gerrit.wikimedia.org/r/808197 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:51:48] (03CR) 10Volans: cumin::target: use concat to manage the file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874859 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [13:52:17] (03PS4) 10Jbond: POC: P:thanos::swift::frontend: move ring manager config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/773794 [13:53:28] (03PS5) 10Jbond: POC: P:thanos::swift::frontend: move ring manager config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/773794 [13:53:53] (03PS6) 10Jbond: POC: P:thanos::swift::frontend: move ring manager config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/773794 [13:55:42] (03PS3) 10Anzx: hewikibooks: update wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978081 (https://phabricator.wikimedia.org/T351913) [13:55:46] (03PS3) 10Anzx: hewikivoyage: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978082 (https://phabricator.wikimedia.org/T351981) [13:56:32] (03CR) 10CI reject: [V: 04-1] POC: P:thanos::swift::frontend: move ring manager config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/773794 (owner: 10Jbond) [13:57:02] (03CR) 10Clément Goubert: [C: 03+2] fix-staging-perms: set set-group-id on /srv/patches subdirs [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [13:59:08] (03CR) 10Volans: "Don't know, would need some time to look at it, don't worry leave it as is in case we'll resume from here." [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [13:59:30] (03CR) 10CI reject: [V: 04-1] sre.swift.audit-labels: Audit the disk labels of swift backend hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/848427 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [13:59:37] (03CR) 10MVernon: P:swift::storage: Use disk_type to identify swift disks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792227 (https://phabricator.wikimedia.org/T300057) (owner: 10Jbond) [13:59:46] (03PS9) 10Hashar: fix-staging-perms: set set-group-id on /srv/patches subdirs [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T1400). [14:00:05] stephanebisson and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:25] o/ [14:00:32] (03CR) 10Esanders: "Should be deployed on 2023-12-06" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978531 (https://phabricator.wikimedia.org/T352232) (owner: 10Esanders) [14:00:33] Hi [14:03:23] (03CR) 10Clément Goubert: [C: 03+2] fix-staging-perms: set set-group-id on /srv/patches subdirs [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:05:29] (03CR) 10Volans: [C: 03+1] "LGTM, but I'd get first a feedback from DC-Ops to make sure they are ok with this." [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 (owner: 10Jbond) [14:06:08] Is anyone available to run the deployment window? [14:06:53] (03PS2) 10Sbisson: Configure wiki-highlights experiment stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978096 (https://phabricator.wikimedia.org/T348613) [14:09:26] (03PS1) 10FNegri: [openstack] Upgrade cloudservices to Antelope [puppet] - 10https://gerrit.wikimedia.org/r/978545 (https://phabricator.wikimedia.org/T348843) [14:09:57] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [14:10:03] I can deploy [14:10:47] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host schema1003.eqiad.wmnet with OS bookworm [14:10:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978096 (https://phabricator.wikimedia.org/T348613) (owner: 10Sbisson) [14:11:50] (03Merged) 10jenkins-bot: Configure wiki-highlights experiment stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978096 (https://phabricator.wikimedia.org/T348613) (owner: 10Sbisson) [14:11:52] 10SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wikimedia Bénin user group - https://phabricator.wikimedia.org/T352285 (10Mh-3110) Hi @Ladsgroup, 1- Yes, it is that UG 2- Also, yes. It is for a public one Many thanks [14:12:00] (03CR) 10Filippo Giunchedi: [C: 03+1] P:wmcs: disable systemd icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/978476 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [14:12:08] (03PS1) 10Muehlenhoff: ganeti: Switch eqsin to PKI [puppet] - 10https://gerrit.wikimedia.org/r/978606 (https://phabricator.wikimedia.org/T350686) [14:12:14] (03CR) 10Filippo Giunchedi: [C: 03+1] RAID - Add instance name to MD RAID alert summary [alerts] - 10https://gerrit.wikimedia.org/r/978485 (owner: 10Slyngshede) [14:12:36] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:978096|Configure wiki-highlights experiment stream (T348613)]] [14:12:37] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/764/con" [puppet] - 10https://gerrit.wikimedia.org/r/978545 (https://phabricator.wikimedia.org/T348843) (owner: 10FNegri) [14:12:49] T348613: Implement Wiki-highlights microsite instrumentation - https://phabricator.wikimedia.org/T348613 [14:14:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [14:14:34] (03CR) 10Filippo Giunchedi: [C: 03+1] arclamp: redirect alerts to o11y [alerts] - 10https://gerrit.wikimedia.org/r/978063 (https://phabricator.wikimedia.org/T349159) (owner: 10Herron) [14:15:46] !log reload thanos-rule on titan[12]001 to pick up new pyrra rec rules [14:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:53] (03CR) 10Volans: [C: 04-1] "I think there are some minor issues" [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond) [14:16:54] scap is taking a while building k8s images… [14:17:05] Ah, that's why [14:17:23] Is it deploying everywhere or to a single test server? [14:17:52] (03PS1) 10Vgutierrez: prometheus::ops: Fix lvs_realserver_clamper config [puppet] - 10https://gerrit.wikimedia.org/r/978608 (https://phabricator.wikimedia.org/T351069) [14:17:53] just to the test servers so far [14:17:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2001.codfw.wmnet with OS bullseye [14:18:07] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host logging-hd2001.codfw.wmnet with OS bullseye [14:18:30] if I’m reading the log correctly, it’s currently waiting while trying to push the finished image to the registry o_O [14:20:59] it's pushing to the registry [14:21:17] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/978608 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:21:20] ok [14:21:23] 5.5MB/s is not much though [14:21:49] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on schema1003.eqiad.wmnet with reason: host reimage [14:24:09] is the progress visible somewhere? /home/lucaswerkmeister-wmde/scap-image-build-and-push-log apparently hasn’t been written since 14:16 UTC [14:24:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on schema1003.eqiad.wmnet with reason: host reimage [14:24:45] ah, it finished [14:25:11] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Switch eqsin to PKI [puppet] - 10https://gerrit.wikimedia.org/r/978606 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [14:25:20] (building the image, that is. now running docker pull on k8s nodes) [14:25:46] (03PS1) 10Vgutierrez: hiera: Set cluster to ncredir on ncredir instances [puppet] - 10https://gerrit.wikimedia.org/r/978609 [14:26:16] Lucas_WMDE: Progress isn't unfortunately [14:26:29] ok thanks [14:26:31] I was watching network usage through grafana [14:26:36] ah ^^ [14:27:09] the docker_pull_k8s also seems unusually slow… is it just a big image today for some reason? [14:27:18] * Lucas_WMDE checks what happened to that geoip ticket [14:28:01] doesn’t sound like that file is included now 🤷 [14:28:41] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:52] Lucas_WMDE: never was [14:29:01] It's always been mounted through the host [14:29:03] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::ops: Fix lvs_realserver_clamper config [puppet] - 10https://gerrit.wikimedia.org/r/978608 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:29:18] Well, always maybe not, but it's been that way for a while [14:30:07] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:58] ok [14:31:39] sync-masters took 2m12s btw [14:32:55] (03PS2) 10Vgutierrez: hiera: Set cluster to ncredir on ncredir instances [puppet] - 10https://gerrit.wikimedia.org/r/978609 [14:33:05] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] prometheus::ops: Fix lvs_realserver_clamper config [puppet] - 10https://gerrit.wikimedia.org/r/978608 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:34:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2002.codfw.wmnet with OS bullseye [14:34:59] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host logging-hd2002.codfw.wmnet with OS bullseye [14:35:55] (03CR) 10Volans: [C: 03+1] "Seems sane to me but I didn't test it. Feel free to merge and test it." [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [14:36:02] aanzx: if you rebase your hewikivoyage change onto the hewikibooks one, I think they can both be deployed together [14:36:14] (though idk if we’ll even have time for one more change after the current one…) [14:36:32] ok [14:36:34] (03CR) 10Majavah: [C: 03+1] [openstack] Upgrade cloudservices to Antelope [puppet] - 10https://gerrit.wikimedia.org/r/978545 (https://phabricator.wikimedia.org/T348843) (owner: 10FNegri) [14:36:49] !log lucaswerkmeister-wmde@deploy2002 sbisson and lucaswerkmeister-wmde: Backport for [[gerrit:978096|Configure wiki-highlights experiment stream (T348613)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:36:54] stephanebisson: please test [14:36:57] T348613: Implement Wiki-highlights microsite instrumentation - https://phabricator.wikimedia.org/T348613 [14:37:03] sync-testservers took 6m14s… [14:37:14] Lucas_WMDE against which server? [14:37:25] anything that the WikimediaDebug extension offers :) [14:37:33] `scap backport` syncs to all of them [14:37:43] e.g. mwdebug2002, or k8s-experimental [14:37:46] wonderful [14:38:00] Lucas_WMDE working as expected [14:38:05] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host schema1003.eqiad.wmnet with OS bookworm [14:38:08] ok! [14:38:12] !log lucaswerkmeister-wmde@deploy2002 sbisson and lucaswerkmeister-wmde: Continuing with sync [14:38:19] let’s see how long this takes now… [14:38:33] I guess the k8s should be relatively fast, because (IIUC) the network traffic already happened during the docker pull earlier [14:38:43] syncing to the bare metal servers might take a while though [14:38:51] yeah, it'll just be the rolling deploy for k8s [14:38:56] (03CR) 10FNegri: "I'm pretty sure we solved the "multiple keys" problem, but I don't remember how 😄 -- I can dig further but I don't think we have a use cas" [puppet] - 10https://gerrit.wikimedia.org/r/874859 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [14:39:00] (JobUnavailable) firing: (6) Reduced availability for job pybal in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:08] which admittedly takes a while because we have a bunch of replicas now [14:39:18] !log cp4052 - depool and disable puppet agent, more pipe debug [14:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:03] (03PS1) 10Muehlenhoff: ganeti: Really enable PKI for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/978610 (https://phabricator.wikimedia.org/T350686) [14:40:14] the docker_pull_k8s ran on 106 somethings (ok: 106) – that’s the nodes, each of which can run several pods, right? [14:40:27] so we have a bit over 100 servers available to k8s at the moment? [14:40:43] (03CR) 10CI reject: [V: 04-1] ganeti: Really enable PKI for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/978610 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [14:41:44] hm, are there only four non-k8s canaries left? (IIRC it used to be about 10) [14:42:04] (03PS1) 10Majavah: hieradata: set a default for role_description in cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/978611 [14:42:50] (03PS4) 10Anzx: hewikibooks: update wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978081 (https://phabricator.wikimedia.org/T351913) [14:43:31] (03CR) 10FNegri: [V: 03+1 C: 03+2] [openstack] Upgrade cloudservices to Antelope [puppet] - 10https://gerrit.wikimedia.org/r/978545 (https://phabricator.wikimedia.org/T348843) (owner: 10FNegri) [14:43:54] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host schema1004.eqiad.wmnet with OS bookworm [14:45:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-hd2001.codfw.wmnet with reason: host reimage [14:45:49] 10SRE, 10Infrastructure-Foundations: Restore Netbox DB from before lsw1-e1-eqiad was removed - https://phabricator.wikimedia.org/T352286 (10cmooney) [14:46:10] (03CR) 10Herron: [C: 03+2] arclamp: redirect alerts to o11y [alerts] - 10https://gerrit.wikimedia.org/r/978063 (https://phabricator.wikimedia.org/T349159) (owner: 10Herron) [14:46:47] 10SRE, 10Infrastructure-Foundations: Restore Netbox DB from before lsw1-e1-eqiad was removed - https://phabricator.wikimedia.org/T352286 (10cmooney) 05Open→03Resolved a:03cmooney Netbox DB has been restored to state as of 11:29 UTC. Netbox services restarted and all seems ok. [14:47:23] (03Merged) 10jenkins-bot: arclamp: redirect alerts to o11y [alerts] - 10https://gerrit.wikimedia.org/r/978063 (https://phabricator.wikimedia.org/T349159) (owner: 10Herron) [14:48:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd2001.codfw.wmnet with reason: host reimage [14:48:30] 10SRE, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Papaul) [14:48:38] jouncebot: next [14:48:38] In 0 hour(s) and 11 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T1500) [14:48:41] hm [14:48:49] (03CR) 10Majavah: [C: 03+2] hieradata: set a default for role_description in cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/978611 (owner: 10Majavah) [14:48:58] I think there might not be time for the logo / wordmark / tagline changes then, sorry aanzx :( [14:49:07] (scap is currently in sync-apaches, 71%) [14:49:09] ok, np [14:49:35] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] hewikibooks: update wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978081 (https://phabricator.wikimedia.org/T351913) (owner: 10Anzx) [14:49:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] hewikivoyage: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978082 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx) [14:49:55] (03PS2) 10Muehlenhoff: ganeti: Really enable PKI for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/978610 (https://phabricator.wikimedia.org/T350686) [14:50:44] (03CR) 10JMeybohm: [C: 03+1] istio: upgrade to Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977214 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [14:51:18] (03CR) 10JMeybohm: [C: 03+1] cert-manager: upgrade to Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977220 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [14:52:29] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Really enable PKI for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/978610 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [14:53:51] I think the php-fpm-restart is going to finish just in time before the window ends [14:54:00] (JobUnavailable) firing: (9) Reduced availability for job lvs_realserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:26] does anyone know why this deployment was so slow? is it worth filing a phab task for it? [14:54:57] (JobUnavailable) firing: (10) Reduced availability for job lvs_realserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:34] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:978096|Configure wiki-highlights experiment stream (T348613)]] (duration: 42m 58s) [14:55:34] 10SRE, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10JMeybohm) 05Open→03Resolved a:03JMeybohm This LGTM now ` /dev/md0: Version : 1.2 Creation Time : Thu Sep 21 12:32:55 2023 Raid Level : raid1 Array Size : 937267200 (... [14:55:42] T348613: Implement Wiki-highlights microsite instrumentation - https://phabricator.wikimedia.org/T348613 [14:55:47] !log UTC afternoon backport+config window done [14:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:14] Thanks Lucas_WMDE! Sorry it was so long [14:56:16] given that it seemingly affected both the k8s image and the regular sync, it feels like either the deployed data is much bigger than usual or internal traffic is somehow slower [14:56:27] (but not, say, that the k8s registry was laggy) [14:56:58] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on schema1004.eqiad.wmnet with reason: host reimage [14:57:03] (03CR) 10CDanis: cfssl::cert: add ability to renew based on a relative value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866602 (owner: 10Jbond) [14:58:47] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:59:26] I don’t see anything suspicious in `ncdu -x /srv/mediawiki-staging/` [14:59:33] (in fact wmf.7 is somewhat smaller than wmf.5) [14:59:39] s/somewhat/slightly/ [15:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T1500) [15:00:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow5002.eqsin.wmnet [15:01:46] Lucas_WMDE: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/caacab569ae338edf14ae39a97df89307e09e1da%5E%21/#F0 seems to have changed the owner of the mediawiki files, so maybe scap considered those as changed files? (cc hashar) [15:02:02] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:02:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-hd2002.codfw.wmnet with reason: host reimage [15:02:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on schema1004.eqiad.wmnet with reason: host reimage [15:02:58] hm [15:03:09] taavi: that could be it I suppose [15:03:28] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1171.eqiad.wmnet with OS bullseye [15:03:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1171.eqiad.wmnet with OS bullseye [15:03:38] not sure how rsync handles permission changes [15:03:52] but it sounds plausible [15:03:53] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1161.eqiad.wmnet with OS bullseye [15:03:56] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1162.eqiad.wmnet with OS bullseye [15:03:58] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1172.eqiad.wmnet with OS bullseye [15:04:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye [15:04:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1162.eqiad.wmnet with OS bullseye [15:04:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1172.eqiad.wmnet with OS bullseye [15:04:58] (JobUnavailable) firing: (11) Reduced availability for job lvs_realserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:16] in that case it would only be a one-time effect (I assume) [15:05:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd2002.codfw.wmnet with reason: host reimage [15:05:50] !log cp4052 - back to normal operations [15:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:23] (03CR) 10Elukey: [V: 03+2 C: 03+2] istio: upgrade to Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977214 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [15:06:34] (03CR) 10Elukey: [V: 03+2 C: 03+2] cert-manager: upgrade to Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977220 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [15:06:39] (03PS1) 10Muehlenhoff: etcd: Remove the use_pki_certs flag [puppet] - 10https://gerrit.wikimedia.org/r/978615 [15:07:03] James_F: is there anything to deploy from Wikifunctions this window? [15:07:17] Nope, please use it if you need it. [15:07:19] ok [15:07:21] aanzx: still around? [15:07:28] then I could deploy those changes now [15:07:31] Yes [15:07:33] and see if scap is faster now [15:07:34] ok [15:07:35] let’s try it :) [15:07:49] (03PS4) 10Lucas Werkmeister (WMDE): hewikivoyage: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978082 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx) [15:07:57] (03PS5) 10Lucas Werkmeister (WMDE): hewikibooks: update wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978081 (https://phabricator.wikimedia.org/T351913) (owner: 10Anzx) [15:08:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow5002.eqsin.wmnet [15:08:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978082 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx) [15:08:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978081 (https://phabricator.wikimedia.org/T351913) (owner: 10Anzx) [15:09:02] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:09:21] Lucas_WMDE: taavi o/ [15:09:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2003.codfw.wmnet with OS bullseye [15:09:28] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host logging-hd2003.codfw.wmnet with OS bullseye [15:09:49] what is the problem? [15:09:59] scap was being very slow [15:10:05] one deployment took up the whole backport+config window now [15:10:12] taavi speculated it was due to the permission changes [15:10:15] (03Merged) 10jenkins-bot: hewikivoyage: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978082 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx) [15:10:16] I’m doing another deployment now [15:10:19] (03Merged) 10jenkins-bot: hewikibooks: update wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978081 (https://phabricator.wikimedia.org/T351913) (owner: 10Anzx) [15:10:26] under the assumption that, if it was the permission changes, it should have been a one-time effect [15:10:26] yeah I changed some rights in /srv/patches and /srv/mediawiki-staging [15:10:32] cause some files were still owned by `wikidev` [15:10:50] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [15:10:52] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:978082|hewikivoyage: update wordmark (T351981)]], [[gerrit:978081|hewikibooks: update wordmark and tagline (T351913)]] [15:10:53] so I guess maybe that caused a fully copy of the code to happen (due to `COPY . /`) [15:10:59] T351981: Change Hebrew Wikivoyage wordmark logo - https://phabricator.wikimedia.org/T351981 [15:10:59] T351913: Update Hebrew Wikibooks logo to new vector style - https://phabricator.wikimedia.org/T351913 [15:11:02] and Docker finding out those files are now "different" [15:11:12] which would indeed explain the slowness :\ [15:11:30] so potentially we ended up rebuild the k8s images from scratch ?! :-\ [15:11:34] yeah, I would sort of expect it for the images in that case [15:11:41] though it also seemed like the non-k8s sync was slower [15:11:52] and I would’ve expected rsync to be smarter about changed permissions with identical content? [15:12:03] but okay, building the k8s image was apparently very fast now, it’s already done [15:12:07] (32s) [15:12:17] !log lucaswerkmeister-wmde@deploy2002 anzx and lucaswerkmeister-wmde: Backport for [[gerrit:978082|hewikivoyage: update wordmark (T351981)]], [[gerrit:978081|hewikibooks: update wordmark and tagline (T351913)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:12:23] testing [15:12:24] I don't really know the details of scap / images building etc [15:12:45] but I have made `scap` to add timing metadata to the events sent to syslog/logstash etc [15:13:22] Lucas_WMDE: looks good [15:13:26] ok! [15:13:27] !log lucaswerkmeister-wmde@deploy2002 anzx and lucaswerkmeister-wmde: Continuing with sync [15:13:34] * Lucas_WMDE prepares the purgeList command [15:13:51] so far it feels like everything is being speedy again [15:15:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host schema1004.eqiad.wmnet with OS bookworm [15:15:40] so I don't know [15:15:47] but l10n update got rebuild as far as I can tell [15:15:48] and Finished sync-testservers (duration: 06m 14s) [15:15:59] which is I guess rsync copying everything over [15:16:14] maybe that is the large cdb files [15:16:15] yeah, I think the only action item in the end will be “we can keep in mind that this happens when lots of permissions change” ^^ [15:16:21] and l10nupdate kicked it cause the files got "changed" [15:16:25] as in they got touched [15:16:32] i don't know really [15:16:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [15:17:00] sorry for the delay -:-\ [15:17:19] sync-apaches was very speedy now, 39s) [15:17:19] ^^ wdqs alerts are known [15:17:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host dns1004.wikimedia.org [15:17:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:17:51] hashar: it’s okay, thanks for brainstorming :) [15:18:10] (03PS55) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [15:18:44] (03PS1) 10Muehlenhoff: Switch dns1004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/978616 (https://phabricator.wikimedia.org/T349619) [15:19:35] (03PS56) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [15:19:53] (03CR) 10Btullis: [C: 03+1] "I think the users who will potentially be affected are: seddon, jkumalah, kzeta, rmurthy, vthamaini, mttp, ppel, amuigai, seve-kim, mattcl" [puppet] - 10https://gerrit.wikimedia.org/r/666367 (owner: 10Jbond) [15:20:03] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:978082|hewikivoyage: update wordmark (T351981)]], [[gerrit:978081|hewikibooks: update wordmark and tagline (T351913)]] (duration: 09m 10s) [15:20:10] T351981: Change Hebrew Wikivoyage wordmark logo - https://phabricator.wikimedia.org/T351981 [15:20:11] T351913: Update Hebrew Wikibooks logo to new vector style - https://phabricator.wikimedia.org/T351913 [15:20:20] Lucas_WMDE: thanks [15:20:23] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1161'] [15:20:27] !log lucaswerkmeister-wmde@mwmaint2002:~$ printf '%s\n' https://en.wikipedia.org/static/images/mobile/copyright/{wikibooks,wikivoyage}-{tagline,wordmark}-he.svg | mwscript purgeList enwiki # T351913, T351981 [15:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:37] (03CR) 10Muehlenhoff: [C: 03+2] Switch dns1004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/978616 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:20:52] (hewikivoyage doesn’t have a tagline but I assume purging the URL doesn’t hurt ^^) [15:21:06] aanzx: np [15:21:18] hashar, taavi: confirmed, this deployment was much faster. nothing more to do about that I think [15:21:22] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1162'] [15:21:24] * Lucas_WMDE done again [15:21:38] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [15:21:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:21:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd2001.codfw.wmnet with OS bullseye [15:21:51] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host logging-hd2001.codfw.wmnet with OS bullseye completed:... [15:21:59] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [15:22:17] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1172'] [15:22:21] !log cp3066 - depool temporarily, log pipe debugging, etc [15:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:25] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1171'] [15:22:39] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1171'] [15:22:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:23:15] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1171'] [15:23:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978540 (owner: 10Muehlenhoff) [15:24:21] Lucas_WMDE: great! thank you to have reported the slowness issue [15:25:49] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:25:59] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:26:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host dns1004.wikimedia.org [15:26:13] (03PS1) 10Bking: admin_ng: tell flink-operator to listen to rdf-streaming-updater ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/978617 (https://phabricator.wikimedia.org/T349095) [15:26:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1161'] [15:26:38] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:27:29] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host dns2004.wikimedia.org [15:27:34] (03PS2) 10Bking: admin_ng: tell flink-operator to listen to rdf-streaming-updater ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/978617 (https://phabricator.wikimedia.org/T349095) [15:27:50] (03CR) 10DCausse: [C: 03+1] admin_ng: tell flink-operator to listen to rdf-streaming-updater ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/978617 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:28:21] (03CR) 10Bking: [C: 03+2] admin_ng: tell flink-operator to listen to rdf-streaming-updater ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/978617 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:28:29] (03PS1) 10Muehlenhoff: Switch dns2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/978618 (https://phabricator.wikimedia.org/T349619) [15:28:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1162'] [15:28:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1171'] [15:30:05] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:30:15] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:30:32] (03PS1) 10Filippo Giunchedi: ncredir: stop exporting 'vhost' label via mtail [puppet] - 10https://gerrit.wikimedia.org/r/978619 (https://phabricator.wikimedia.org/T351934) [15:30:42] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host schema2003.codfw.wmnet with OS bookworm [15:31:14] (03Merged) 10jenkins-bot: admin_ng: tell flink-operator to listen to rdf-streaming-updater ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/978617 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:31:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1172'] [15:32:34] (03PS7) 10Jbond: sre.swift.audit-labels: Audit the disk labels of swift backend hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/848427 (https://phabricator.wikimedia.org/T308677) [15:33:10] (03CR) 10Jbond: [V: 03+1] P:swift::storage: Use disk_type to identify swift disks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792227 (https://phabricator.wikimedia.org/T300057) (owner: 10Jbond) [15:33:20] (03Abandoned) 10Jbond: P:swift::storage: Use disk_type to identify swift disks [puppet] - 10https://gerrit.wikimedia.org/r/792227 (https://phabricator.wikimedia.org/T300057) (owner: 10Jbond) [15:33:53] !log cp3066 - all back to normal ops [15:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:34] !log bking@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:34:43] !log bking@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:34:48] (03CR) 10Muehlenhoff: [C: 03+2] Switch dns2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/978618 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:34:50] !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:35:00] !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:35:09] !log bking@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:35:17] (03Abandoned) 10Jbond: cumin::target: use concat to manage the file [puppet] - 10https://gerrit.wikimedia.org/r/874859 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [15:35:53] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:36:05] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1161.eqiad.wmnet with reason: host reimage [15:36:40] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:36:54] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:37:16] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [15:37:37] (03CR) 10Jbond: [V: 03+1 C: 03+2] admin: update admin modules so users without sshkeys get nologin shell [puppet] - 10https://gerrit.wikimedia.org/r/666367 (owner: 10Jbond) [15:38:14] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1162.eqiad.wmnet with reason: host reimage [15:38:37] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1171.eqiad.wmnet with reason: host reimage [15:39:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1161.eqiad.wmnet with reason: host reimage [15:39:37] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [15:39:57] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [15:40:50] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1172.eqiad.wmnet with reason: host reimage [15:42:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1162.eqiad.wmnet with reason: host reimage [15:42:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2096.mgmt.codfw.wmnet with reboot policy FORCED [15:44:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2096.mgmt.codfw.wmnet with reboot policy FORCED [15:44:28] (03CR) 10Vgutierrez: [C: 03+1] "PCC happy on https://puppet-compiler.wmflabs.org/output/978609/767/" [puppet] - 10https://gerrit.wikimedia.org/r/978609 (owner: 10Vgutierrez) [15:44:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1171.eqiad.wmnet with reason: host reimage [15:45:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2096.mgmt.codfw.wmnet with reboot policy FORCED [15:45:52] (03CR) 10Ssingh: [C: 03+1] hiera: Set cluster to ncredir on ncredir instances [puppet] - 10https://gerrit.wikimedia.org/r/978609 (owner: 10Vgutierrez) [15:45:55] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2096'] [15:46:06] (03CR) 10Vgutierrez: [C: 03+2] hiera: Set cluster to ncredir on ncredir instances [puppet] - 10https://gerrit.wikimedia.org/r/978609 (owner: 10Vgutierrez) [15:46:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2096'] [15:47:46] (03CR) 10Kamila Součková: [C: 03+1] [aux-k8s-eqiad] add kube-state-metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/978129 (https://phabricator.wikimedia.org/T264625) (owner: 10CDanis) [15:47:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1172.eqiad.wmnet with reason: host reimage [15:48:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978615 (owner: 10Muehlenhoff) [15:48:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:48:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd2002.codfw.wmnet with OS bullseye [15:49:01] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host logging-hd2002.codfw.wmnet with OS bullseye completed:... [15:49:03] (03PS1) 10Jbond: Revert "admin: update admin modules so users without sshkeys get nologin shell" [puppet] - 10https://gerrit.wikimedia.org/r/978509 [15:49:14] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "admin: update admin modules so users without sshkeys get nologin shell" [puppet] - 10https://gerrit.wikimedia.org/r/978509 (owner: 10Jbond) [15:49:23] (03CR) 10Kamila Součková: [C: 03+1] Deploy kube-state-metrics to the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/978504 (https://phabricator.wikimedia.org/T264625) (owner: 10Btullis) [15:49:34] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10Papaul) [15:49:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:50:14] (03CR) 10Vgutierrez: [C: 03+1] ncredir: stop exporting 'vhost' label via mtail [puppet] - 10https://gerrit.wikimedia.org/r/978619 (https://phabricator.wikimedia.org/T351934) (owner: 10Filippo Giunchedi) [15:50:27] !log dancy@deploy2002 Installing scap version "4.64.0" for 570 hosts [15:51:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host dns2004.wikimedia.org [15:52:24] !log dancy@deploy2002 Installing scap version "4.64.0" for 570 hosts [15:52:28] (03CR) 10Filippo Giunchedi: [C: 03+2] ncredir: stop exporting 'vhost' label via mtail [puppet] - 10https://gerrit.wikimedia.org/r/978619 (https://phabricator.wikimedia.org/T351934) (owner: 10Filippo Giunchedi) [15:52:44] (03PS1) 10Vgutierrez: ncredir: Enable IPIP encapsulation on codfw [puppet] - 10https://gerrit.wikimedia.org/r/978624 (https://phabricator.wikimedia.org/T351069) [15:53:48] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [15:54:06] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [15:54:15] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/768/con" [puppet] - 10https://gerrit.wikimedia.org/r/978624 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:54:37] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1199 - https://phabricator.wikimedia.org/T352238 (10VRiley-WMF) Confirmed: Service Request 180679043 was successfully submitted. [15:56:09] (03PS1) 10Vgutierrez: hiera: Enable IPIP on codfw text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/978625 (https://phabricator.wikimedia.org/T351069) [15:56:15] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host lvs4008.ulsfo.wmnet [15:56:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:56:49] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [15:56:59] (03PS5) 10MdsShakil: Create new namespaces and namespace aliases for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903) [15:58:29] (03PS1) 10Muehlenhoff: Switch lvs4008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/978626 (https://phabricator.wikimedia.org/T349619) [15:58:32] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on schema2003.codfw.wmnet with reason: host reimage [15:58:50] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [15:59:23] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:00:38] PROBLEM - WDQS SPARQL on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:00:46] (03CR) 10Muehlenhoff: [C: 03+2] Switch lvs4008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/978626 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [16:00:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [16:01:20] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/978625 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:01:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:02:57] (03CR) 10Majavah: [C: 03+2] P:wmcs: disable systemd icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/978476 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [16:03:12] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on schema2003.codfw.wmnet with reason: host reimage [16:03:22] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-October-December): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10elukey) @KartikMistry I think that the MinT python code should be able to pull the model binary from Swift... [16:03:42] (03PS57) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [16:04:28] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [16:04:29] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:04:38] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [16:04:41] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [16:04:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:04:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1161.eqiad.wmnet with OS bullseye [16:05:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye completed: - an-worker1161 (**WA... [16:05:21] (03CR) 10Fabfur: [C: 03+1] ncredir: Enable IPIP encapsulation on codfw [puppet] - 10https://gerrit.wikimedia.org/r/978624 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:05:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:05:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1162.eqiad.wmnet with OS bullseye [16:05:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1162.eqiad.wmnet with OS bullseye completed: - an-worker1162 (**WA... [16:05:47] (03CR) 10Fabfur: [C: 03+1] hiera: Enable IPIP on codfw text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/978625 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:05:48] RECOVERY - WDQS SPARQL on wdqs1016 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:05:49] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [16:05:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host lvs4008.ulsfo.wmnet [16:05:52] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:05:53] (03CR) 10Ssingh: [C: 03+1] hiera: Enable IPIP on codfw text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/978625 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:06:32] (03CR) 10Ssingh: [C: 03+2] conftool: introduce schema and host file for dnsboxes [puppet] - 10https://gerrit.wikimedia.org/r/975009 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:06:44] (03CR) 10Jbond: [C: 03+2] syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [16:06:55] (03PS13) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) [16:07:02] (03CR) 10Jbond: [V: 03+2] syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [16:07:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:07:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1172.eqiad.wmnet with OS bullseye [16:07:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1172.eqiad.wmnet with OS bullseye completed: - an-worker1172 (**PA... [16:07:38] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host lvs4009.ulsfo.wmnet [16:07:42] (03PS4) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [16:07:44] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:08:35] (03CR) 10CI reject: [V: 04-1] (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [16:08:48] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:08:50] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [16:08:54] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1171.eqiad.wmnet with OS bullseye [16:08:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1171.eqiad.wmnet with OS bullseye completed: - an-worker1171 (**WA... [16:11:36] (03PS1) 10Vgutierrez: lvs: Use profile::base::enable_rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/978627 [16:12:03] (03PS1) 10Muehlenhoff: Switch lvs4009 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/978628 (https://phabricator.wikimedia.org/T349619) [16:12:52] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [16:12:58] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [16:13:09] (03CR) 10Muehlenhoff: [C: 03+2] Switch lvs4009 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/978628 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [16:13:19] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (2) The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [16:13:26] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/770/con" [puppet] - 10https://gerrit.wikimedia.org/r/978627 (owner: 10Vgutierrez) [16:13:45] !log reload all thanos-rule daemons on titan* to pick up new Pyrra Lift Wing rules [16:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:59] !log restart pyrra-filesystem on titan* [16:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:22] (03PS2) 10Vgutierrez: ncredir: Enable IPIP encapsulation on codfw [puppet] - 10https://gerrit.wikimedia.org/r/978624 (https://phabricator.wikimedia.org/T351069) [16:15:19] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:15:50] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [16:16:19] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [16:16:39] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host schema2003.codfw.wmnet with OS bookworm [16:16:39] (03PS24) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [16:16:41] (03PS20) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [16:16:43] (03PS23) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [16:16:45] (03PS23) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [16:16:47] (03PS9) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) [16:17:03] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/771/con" [puppet] - 10https://gerrit.wikimedia.org/r/978624 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:17:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host lvs4009.ulsfo.wmnet [16:17:56] (03CR) 10Ssingh: [C: 03+1] "Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/978624 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:18:02] (03PS1) 10Hnowlan: jobqueue: move move jobs to k8s jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/978630 (https://phabricator.wikimedia.org/T349796) [16:18:07] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host lvs4010.ulsfo.wmnet [16:18:35] (03CR) 10BBlack: [C: 03+1] "LGTM, much better way!" [puppet] - 10https://gerrit.wikimedia.org/r/978627 (owner: 10Vgutierrez) [16:19:00] (03PS1) 10Muehlenhoff: Switch lvs4010 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/978631 (https://phabricator.wikimedia.org/T349619) [16:19:03] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] lvs: Use profile::base::enable_rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/978627 (owner: 10Vgutierrez) [16:19:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/772/console" [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [16:20:19] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:20:24] (03CR) 10Jbond: [V: 03+1 C: 03+2] profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [16:20:41] (03CR) 10Jbond: [C: 03+2] sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [16:20:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] jobqueue: move move jobs to k8s jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/978630 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:21:02] (03CR) 10Jbond: [C: 03+2] syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [16:21:19] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [16:21:31] (03CR) 10Jbond: [C: 03+2] profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [16:22:16] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/978600 [16:22:22] (03CR) 10Hnowlan: [C: 03+2] jobqueue: move move jobs to k8s jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/978630 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:22:52] (03CR) 10Muehlenhoff: [C: 03+2] Switch lvs4010 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/978631 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [16:23:19] (03CR) 10Jbond: [C: 03+2] profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [16:23:19] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (3) The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [16:23:45] (03Merged) 10jenkins-bot: jobqueue: move move jobs to k8s jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/978630 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:24:54] (03CR) 10MVernon: [C: 03+1] "Hi," [cookbooks] - 10https://gerrit.wikimedia.org/r/848427 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [16:24:58] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:28:24] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [16:28:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host lvs4010.ulsfo.wmnet [16:28:52] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [16:28:59] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:31:50] 10SRE, 10Infrastructure-Foundations, 10Observability-Logging, 10Puppet (Puppet 7.0): Switch rsyslog to use the new PKI infrastructure - https://phabricator.wikimedia.org/T347565 (10CDanis) Conclusion at end of meeting was that o11y would migrate the base profile to use the new cfssl support ~next week [16:32:06] (03CR) 10MVernon: "Hi," [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [16:32:09] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [16:32:33] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:33:19] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [16:33:34] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [16:34:38] (03PS58) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [16:34:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:35:28] 10SRE-swift-storage: Swift container for archived mariadb tables - https://phabricator.wikimedia.org/T350924 (10MatthewVernon) I think we've broadly agreed the "process"; do you want to put a wikitech page together with that (and the initial data set(s)) on? And suggest a name for the swift account and I'll get... [16:36:44] (03PS1) 10Elukey: profile::pyrra::filesystem: fix lift wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/978633 [16:36:53] !log bking@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [16:37:20] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [16:38:19] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [16:38:32] (03CR) 10Herron: [C: 03+1] profile::pyrra::filesystem: fix lift wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/978633 (owner: 10Elukey) [16:38:33] ^^ WDQS/WCQS alerts are expected [16:38:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [16:38:52] inflatador: ebernhardson is trying to silence those alerts [16:39:18] !log confctl --object-type dnsbox select 'name=' set/ip= [16:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:41:50] (03CR) 10Elukey: [C: 03+2] profile::pyrra::filesystem: fix lift wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/978633 (owner: 10Elukey) [16:41:52] (03PS1) 10Jforrester: Revert "wikifunctions: Switch JavaScript evaluator to 2023-11-22-195017" [deployment-charts] - 10https://gerrit.wikimedia.org/r/978510 [16:41:57] (03CR) 10Jforrester: [C: 03+2] Revert "wikifunctions: Switch JavaScript evaluator to 2023-11-22-195017" [deployment-charts] - 10https://gerrit.wikimedia.org/r/978510 (owner: 10Jforrester) [16:42:46] (03Merged) 10jenkins-bot: Revert "wikifunctions: Switch JavaScript evaluator to 2023-11-22-195017" [deployment-charts] - 10https://gerrit.wikimedia.org/r/978510 (owner: 10Jforrester) [16:42:52] (03PS6) 10Jforrester: wikifunctions: Switch orchestrator to 2023-11-06-172159 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971999 (https://phabricator.wikimedia.org/T297509) [16:42:56] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Switch orchestrator to 2023-11-06-172159 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971999 (https://phabricator.wikimedia.org/T297509) (owner: 10Jforrester) [16:43:25] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:43:45] (03Merged) 10jenkins-bot: wikifunctions: Switch orchestrator to 2023-11-06-172159 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971999 (https://phabricator.wikimedia.org/T297509) (owner: 10Jforrester) [16:44:07] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:44:26] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:45:11] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:45:39] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:46:47] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:46:59] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:48:10] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:48:23] (03PS2) 10Jforrester: wikifunctions: Switch orchestrator to 2023-11-17-200241 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976848 (https://phabricator.wikimedia.org/T297509) [16:49:22] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Switch orchestrator to 2023-11-17-200241 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976848 (https://phabricator.wikimedia.org/T297509) (owner: 10Jforrester) [16:49:58] (03PS4) 10Ssingh: P:dns::auth::update: add support for setting ferm rules via confd [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) [16:50:00] (03PS1) 10Bking: flink-zk: Activate codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/978634 (https://phabricator.wikimedia.org/T349095) [16:50:15] (03Merged) 10jenkins-bot: wikifunctions: Switch orchestrator to 2023-11-17-200241 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976848 (https://phabricator.wikimedia.org/T297509) (owner: 10Jforrester) [16:50:25] (03CR) 10Ssingh: "rebased, no code change" [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:51:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:42] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:52:23] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:53:01] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/978634 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [16:53:07] !log sudo confctl --object-type dnsbox select 'dc=.*' set/pooled=yes T347054 [16:53:10] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:14] T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 [16:54:18] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:54:40] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:54:45] (03CR) 10Bking: [C: 03+2] flink-zk: Activate codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/978634 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [16:55:54] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:57:27] (03PS1) 10FNegri: [openstack] Upgrade all remaining hosts to Antelope [puppet] - 10https://gerrit.wikimedia.org/r/978636 (https://phabricator.wikimedia.org/T348843) [16:57:35] (03PS1) 10Elukey: istio: upgrade Docker images to 1.15.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/978637 (https://phabricator.wikimedia.org/T351933) [16:58:51] (03CR) 10CDanis: [C: 03+1] "+1 for aux" [deployment-charts] - 10https://gerrit.wikimedia.org/r/978637 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [17:06:58] (03CR) 10Klausman: [C: 03+1] istio: upgrade Docker images to 1.15.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/978637 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [17:09:00] (03PS1) 10Bking: flink-zk: Add codfw flink-zk cluster info [puppet] - 10https://gerrit.wikimedia.org/r/978639 (https://phabricator.wikimedia.org/T349095) [17:10:56] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host schema2004.codfw.wmnet with OS bookworm [17:12:56] (03CR) 10Btullis: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/978639 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [17:14:00] !log disable puppet on A:dns-rec to roll out CR 975843: T347054 [17:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:17] T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 [17:15:39] (03CR) 10Ssingh: [C: 03+2] P:dns::auth::update: add support for setting ferm rules via confd [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:16:31] (03CR) 10Bking: [C: 03+2] flink-zk: Add codfw flink-zk cluster info [puppet] - 10https://gerrit.wikimedia.org/r/978639 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [17:17:15] !log depooling cp2029 for some manual testing [17:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:22:58] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logging-hd2003.codfw.wmnet with OS bullseye [17:23:02] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host logging-hd2003.codfw.wmnet with OS bullseye executed wi... [17:23:31] (03PS1) 10Elukey: cert-manager: bump appVersion [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933) [17:26:54] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on schema2004.codfw.wmnet with reason: host reimage [17:32:27] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on schema2004.codfw.wmnet with reason: host reimage [17:35:05] !log re-enable Puppet on A:dns-rec [17:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:21] PROBLEM - Host flink-zk2001 is DOWN: PING CRITICAL - Packet loss = 100% [17:35:43] !log A:dns-rec: force run-puppet-agent [17:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:55] RECOVERY - Host flink-zk2001 is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms [17:36:23] PROBLEM - Check systemd state on cp2029 is CRITICAL: CRITICAL - degraded: The following units failed: haproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:45] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [17:36:49] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp2029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [17:37:27] RECOVERY - Check systemd state on cp2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:49] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2029 is OK: SSL OK - OCSP staple validity for wikipedia.org has 220930 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2024-01-19 05:54:59 +0000 (expires in 50 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:37:55] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp2029 is OK: SSL OK - OCSP staple validity for wikipedia.org has 181324 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-01-19 05:55:13 +0000 (expires in 50 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:38:13] PROBLEM - Host flink-zk2003 is DOWN: PING CRITICAL - Packet loss = 100% [17:38:39] RECOVERY - Host flink-zk2003 is UP: PING OK - Packet loss = 0%, RTA = 72.94 ms [17:40:14] !log running dummy authdns-update [17:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:23] (03CR) 10Krinkle: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [17:40:26] (03PS2) 10Jforrester: wikifunctions: Switch Python evaluator to 2023-11-22-195017 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976847 (https://phabricator.wikimedia.org/T281500) [17:41:54] !log [finished] running dummy authdns-update, all 14 hosts affected [17:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:35] !log bking@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [17:43:48] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [17:46:38] (03PS3) 10Ssingh: P:dns::auth::update: add support for authdns-update hosts via confd [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) [17:47:07] !log bking@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [17:47:11] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [17:47:17] (03CR) 10Ssingh: "Reverting to PS1 as we decided against the check script for now that specified the min threshold." [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:47:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:48:27] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp2029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [17:49:05] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp2029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [17:49:37] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [17:50:08] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [17:50:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host schema2004.codfw.wmnet with OS bookworm [17:51:02] (03Merged) 10jenkins-bot: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [17:51:49] PROBLEM - Check systemd state on cp2029 is CRITICAL: CRITICAL - degraded: The following units failed: haproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:13] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2029 is OK: SSL OK - OCSP staple validity for wikipedia.org has 220067 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2024-01-19 05:54:59 +0000 (expires in 50 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:52:16] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: update values for application mode (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [17:52:19] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp2029 is OK: SSL OK - OCSP staple validity for wikipedia.org has 180460 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-01-19 05:55:13 +0000 (expires in 50 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:52:57] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp2029 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 227222 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2024-01-24 11:51:04 +0000 (expires in 55 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:52:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:52:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (16) wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:52:59] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:53:09] RECOVERY - Check systemd state on cp2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:19] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [17:56:03] (03CR) 10BBlack: [C: 03+1] P:dns::auth::update: add support for authdns-update hosts via confd [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:57:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:57:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (16) wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:57:59] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:58:19] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (2) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [17:59:25] ^^ flink error is expected [18:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T1800) [18:02:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (3) wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:02:59] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (16) wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:04:16] (03PS2) 10Kimberly Sarabia: Define the corresponding stream for scroll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) [18:09:58] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [18:10:02] (03PS3) 10Jcrespo: Migrate TLS configuration to separate file and prepare for puppet call [software/mediabackups] - 10https://gerrit.wikimedia.org/r/978133 (https://phabricator.wikimedia.org/T327157) [18:31:46] !log disable puppet on A:dns-rec to roll out CR 976254: T347054 [18:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:03] T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 [18:32:36] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::auth::update: add support for authdns-update hosts via confd [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [18:34:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2003.codfw.wmnet with OS bullseye [18:34:55] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host logging-hd2003.codfw.wmnet with OS bullseye [18:35:13] (03PS1) 10Jcrespo: Prepare for 0.2.0 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/978643 (https://phabricator.wikimedia.org/T327157) [18:35:58] (03PS2) 10Jcrespo: Prepare for 0.2.0 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/978643 (https://phabricator.wikimedia.org/T327157) [18:45:03] 10SRE, 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T352027 (10phaultfinder) [18:51:18] (03PS1) 10Cory Massaro: wikifunctions: Switch orchestrator to 2023-11-29-152839 [deployment-charts] - 10https://gerrit.wikimedia.org/r/978671 (https://phabricator.wikimedia.org/T327275) [18:52:47] (03PS2) 10Jforrester: wikifunctions: Switch orchestrator to 2023-11-29-152839 [deployment-charts] - 10https://gerrit.wikimedia.org/r/978671 (https://phabricator.wikimedia.org/T327275) (owner: 10Cory Massaro) [18:53:02] (03PS3) 10Jforrester: wikifunctions: Switch orchestrator to 2023-11-29-152839 [deployment-charts] - 10https://gerrit.wikimedia.org/r/978671 (https://phabricator.wikimedia.org/T327275) (owner: 10Cory Massaro) [18:53:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:53:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:54:48] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Switch orchestrator to 2023-11-29-152839 [deployment-charts] - 10https://gerrit.wikimedia.org/r/978671 (https://phabricator.wikimedia.org/T327275) (owner: 10Cory Massaro) [18:55:39] (03Merged) 10jenkins-bot: wikifunctions: Switch orchestrator to 2023-11-29-152839 [deployment-charts] - 10https://gerrit.wikimedia.org/r/978671 (https://phabricator.wikimedia.org/T327275) (owner: 10Cory Massaro) [18:56:44] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:57:26] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:58:09] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:58:24] !log re-enable Puppet on A:dns-rec [18:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:58:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (2) wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:59:09] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:59:11] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [19:00:05] hashar and jeena: Time to snap out of that daydream and deploy Train log triage with CPT. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T1900). [19:00:05] hashar and jeena: Your horoscope predicts another unfortunate MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T1900). [19:00:25] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [19:03:20] (03PS3) 10Jforrester: wikifunctions: Switch Python evaluator to 2023-11-29-143341 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976847 (https://phabricator.wikimedia.org/T281500) [19:04:04] (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Switch Python evaluator to 2023-11-29-143341 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976847 (https://phabricator.wikimedia.org/T281500) (owner: 10Jforrester) [19:04:24] (03PS3) 10Ssingh: P:dns::auth::update: add support for generating .ssh/config via confd [puppet] - 10https://gerrit.wikimedia.org/r/977101 (https://phabricator.wikimedia.org/T347054) [19:04:32] (03PS1) 10Andrew Bogott: labs-ip-alias-dump.py: fix check for attached/detached IPs [puppet] - 10https://gerrit.wikimedia.org/r/978676 [19:04:56] (03Merged) 10jenkins-bot: wikifunctions: Switch Python evaluator to 2023-11-29-143341 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976847 (https://phabricator.wikimedia.org/T281500) (owner: 10Jforrester) [19:04:58] (JobUnavailable) firing: (10) Reduced availability for job lvs_realserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:07:14] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:11:04] (03PS4) 10Ssingh: P:dns::auth::update: add support for generating .ssh/config via confd [puppet] - 10https://gerrit.wikimedia.org/r/977101 (https://phabricator.wikimedia.org/T347054) [19:12:16] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/775/con" [puppet] - 10https://gerrit.wikimedia.org/r/977101 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:12:49] (03PS1) 10Jforrester: Revert "wikifunctions: Switch Python evaluator to 2023-11-29-143341" [deployment-charts] - 10https://gerrit.wikimedia.org/r/978512 [19:12:54] (03CR) 10Jforrester: [C: 03+2] Revert "wikifunctions: Switch Python evaluator to 2023-11-29-143341" [deployment-charts] - 10https://gerrit.wikimedia.org/r/978512 (owner: 10Jforrester) [19:13:47] (03Merged) 10jenkins-bot: Revert "wikifunctions: Switch Python evaluator to 2023-11-29-143341" [deployment-charts] - 10https://gerrit.wikimedia.org/r/978512 (owner: 10Jforrester) [19:17:25] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:18:11] (03PS1) 10Jforrester: wikifunctions: Switch JavaScript evaluator to 2023-11-29-143341 [deployment-charts] - 10https://gerrit.wikimedia.org/r/978678 (https://phabricator.wikimedia.org/T327275) [19:19:33] (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Switch JavaScript evaluator to 2023-11-29-143341 [deployment-charts] - 10https://gerrit.wikimedia.org/r/978678 (https://phabricator.wikimedia.org/T327275) (owner: 10Jforrester) [19:20:17] (03PS1) 10Bartosz Dziewoński: Update CentralAuth login failures metric [puppet] - 10https://gerrit.wikimedia.org/r/978679 (https://phabricator.wikimedia.org/T351948) [19:20:25] (03Merged) 10jenkins-bot: wikifunctions: Switch JavaScript evaluator to 2023-11-29-143341 [deployment-charts] - 10https://gerrit.wikimedia.org/r/978678 (https://phabricator.wikimedia.org/T327275) (owner: 10Jforrester) [19:20:38] (03PS2) 10Bartosz Dziewoński: Update CentralAuth login failures metric [puppet] - 10https://gerrit.wikimedia.org/r/978679 (https://phabricator.wikimedia.org/T351948) [19:21:06] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:21:38] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:22:01] (03CR) 10Jforrester: "Is this the correct way to do what we want (shorten the waiting time before giving up on a deploy and rolling back)?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/975873 (owner: 10Jforrester) [19:22:28] (03CR) 10Bartosz Dziewoński: "(added in https://gerrit.wikimedia.org/r/c/operations/puppet/+/350555)" [puppet] - 10https://gerrit.wikimedia.org/r/978679 (https://phabricator.wikimedia.org/T351948) (owner: 10Bartosz Dziewoński) [19:23:32] (03PS5) 10Ssingh: P:dns::auth::update: add support for generating .ssh/config via confd [puppet] - 10https://gerrit.wikimedia.org/r/977101 (https://phabricator.wikimedia.org/T347054) [19:24:01] (03CR) 10BBlack: [C: 03+1] P:dns::auth::update: add support for generating .ssh/config via confd [puppet] - 10https://gerrit.wikimedia.org/r/977101 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:24:42] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/776/con" [puppet] - 10https://gerrit.wikimedia.org/r/977101 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:25:59] !log disable puppet on A:dns-rec to roll out CR 977101: T347054 [19:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:09] T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 [19:26:33] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::auth::update: add support for generating .ssh/config via confd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977101 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:29:43] (03CR) 10Majavah: [C: 03+1] labs-ip-alias-dump.py: fix check for attached/detached IPs [puppet] - 10https://gerrit.wikimedia.org/r/978676 (owner: 10Andrew Bogott) [19:31:33] (03PS1) 10Ssingh: P:dns::auth::update: use correct key name ("ip" not "IP") [puppet] - 10https://gerrit.wikimedia.org/r/978680 (https://phabricator.wikimedia.org/T347054) [19:32:32] (03PS1) 10Jdrewniak: Deploy Vector 2022 skin to next set of sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978681 (https://phabricator.wikimedia.org/T352074) [19:33:52] (03CR) 10BBlack: [C: 03+1] P:dns::auth::update: use correct key name ("ip" not "IP") [puppet] - 10https://gerrit.wikimedia.org/r/978680 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:34:39] (03CR) 10Ssingh: [C: 03+2] P:dns::auth::update: use correct key name ("ip" not "IP") [puppet] - 10https://gerrit.wikimedia.org/r/978680 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:39:04] !log running authdns-update from dns6001 [19:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:24] (03CR) 10Andrew Bogott: [C: 03+2] labs-ip-alias-dump.py: fix check for attached/detached IPs [puppet] - 10https://gerrit.wikimedia.org/r/978676 (owner: 10Andrew Bogott) [19:47:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10cmooney) @VReilly-WMF just a heads up, for kubernetes1059 I think you selected ssw1-e1-eiqad (this is the spine switch with QSFP ports), rather than lsw1-e1-eqiad (this is the LE... [19:47:25] (03CR) 10Dzahn: "This broke our deployment server in devtools in cloud VPS it seems:" [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [19:49:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [19:50:39] (03CR) 10Dzahn: "I got it fixed by adding the deployment_group key to Hiera but had to find this and seems like this would break any other deployment serve" [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [19:54:11] (03PS5) 10DDesouza: Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976844 (https://phabricator.wikimedia.org/T351353) [19:58:25] PROBLEM - AuthDNS-over-TLS Works on dns6001 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS [19:58:28] ouch [19:58:28] (03PS1) 10DDesouza: Update coverage of Reader Demographics 2 surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978683 (https://phabricator.wikimedia.org/T344393) [19:58:29] expected [19:59:53] RECOVERY - AuthDNS-over-TLS Works on dns6001 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS [20:02:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2087.codfw.wmnet with OS bookworm [20:04:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-hd2003.codfw.wmnet with reason: host reimage [20:08:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd2003.codfw.wmnet with reason: host reimage [20:14:56] 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to wmf, releng, ciadmin for sandeeps - https://phabricator.wikimedia.org/T352334 (10thcipriani) [20:15:00] (03PS1) 10Ssingh: P:dns::auth: pass .ssh/config to authdns-update [puppet] - 10https://gerrit.wikimedia.org/r/978685 (https://phabricator.wikimedia.org/T347054) [20:15:16] (03CR) 10Andrea Denisse: ircecho: Migrate the ircecho script from Python 2 to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [20:15:55] (03PS1) 10Jdlrobson: Fix incorrect client-pref-pinned classes when client pref feature is disabled [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978513 (https://phabricator.wikimedia.org/T351141) [20:17:00] (03CR) 10BBlack: [C: 03+1] P:dns::auth: pass .ssh/config to authdns-update [puppet] - 10https://gerrit.wikimedia.org/r/978685 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [20:17:57] (03CR) 10Ssingh: [C: 03+2] P:dns::auth: pass .ssh/config to authdns-update [puppet] - 10https://gerrit.wikimedia.org/r/978685 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [20:20:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2087.codfw.wmnet with reason: host reimage [20:22:17] !log dns6001: running dummy authdns-update [20:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2087.codfw.wmnet with reason: host reimage [20:24:58] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:25:27] !log sudo cumin -s1 -b60 "A:dns-rec and not P{dns6001*}" "enable-puppet 'do not enable' && run-puppet-agent": T347054 [20:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:38] !log [correction] sudo cumin -b1 -s60 "A:dns-rec and not P{dns6001*}" "enable-puppet 'do not enable' && run-puppet-agent": T347054 [20:25:44] T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 [20:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:49] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 126, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:28:49] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:29:03] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:29:57] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:30:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:31:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd2003.codfw.wmnet with OS bullseye [20:31:07] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host logging-hd2003.codfw.wmnet with OS bullseye completed:... [20:42:35] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:42:56] (03PS1) 10Jforrester: wikifunctions: Set WASM-related environmental variables for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/978691 [20:43:28] (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Set WASM-related environmental variables for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/978691 (owner: 10Jforrester) [20:44:22] (03Merged) 10jenkins-bot: wikifunctions: Set WASM-related environmental variables for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/978691 (owner: 10Jforrester) [20:45:18] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:47:45] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:48:17] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:49:01] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [20:50:40] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [20:51:37] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [20:53:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:53:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2087.codfw.wmnet with OS bookworm [20:53:27] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [20:53:58] (03PS1) 10Jforrester: wikifunctions: Switch Python evaluator to 2023-11-29-143341 (try 3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978514 (https://phabricator.wikimedia.org/T281500) [20:54:11] (03PS2) 10Jforrester: wikifunctions: Switch Python evaluator to 2023-11-29-143341 (try 3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978514 (https://phabricator.wikimedia.org/T281500) [20:54:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bookworm [20:56:55] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Switch Python evaluator to 2023-11-29-143341 (try 3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978514 (https://phabricator.wikimedia.org/T281500) (owner: 10Jforrester) [20:57:49] (03Merged) 10jenkins-bot: wikifunctions: Switch Python evaluator to 2023-11-29-143341 (try 3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978514 (https://phabricator.wikimedia.org/T281500) (owner: 10Jforrester) [21:00:00] !log dummy authdns-update [21:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Your horoscope predicts another unfortunate UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T2100). [21:00:05] tgr, jan_drewniak, and danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:52] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:01:22] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:01:35] o/ [21:01:46] o/ [21:02:00] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:02:21] I can deploy. [21:02:29] jan_drewniak: around? [21:02:47] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:02:58] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:03:19] tgr: around! [21:03:48] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:04:20] (03CR) 10Gergő Tisza: [C: 03+2] Fix incorrect client-pref-pinned classes when client pref feature is disabled [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978513 (https://phabricator.wikimedia.org/T351141) (owner: 10Jdlrobson) [21:06:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978681 (https://phabricator.wikimedia.org/T352074) (owner: 10Jdrewniak) [21:06:53] (03Merged) 10jenkins-bot: Deploy Vector 2022 skin to next set of sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978681 (https://phabricator.wikimedia.org/T352074) (owner: 10Jdrewniak) [21:07:20] !log tgr@deploy2002 Started scap: Backport for [[gerrit:978681|Deploy Vector 2022 skin to next set of sister projects (T352074)]] [21:07:34] T352074: Deploy Vector 2022 skin to next set of sister projects - https://phabricator.wikimedia.org/T352074 [21:08:45] !log tgr@deploy2002 tgr and jdrewniak: Backport for [[gerrit:978681|Deploy Vector 2022 skin to next set of sister projects (T352074)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:09:40] jan_drewniak: ^ [21:10:07] tgr: ok checking... [21:10:54] * jan_drewniak tgr: yup, looks good to sync [21:11:09] !log tgr@deploy2002 tgr and jdrewniak: Continuing with sync [21:12:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2088.codfw.wmnet with reason: host reimage [21:16:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2088.codfw.wmnet with reason: host reimage [21:17:39] !log tgr@deploy2002 Finished scap: Backport for [[gerrit:978681|Deploy Vector 2022 skin to next set of sister projects (T352074)]] (duration: 10m 18s) [21:17:54] T352074: Deploy Vector 2022 skin to next set of sister projects - https://phabricator.wikimedia.org/T352074 [21:19:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976844 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:20:05] (03Merged) 10jenkins-bot: Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976844 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:20:27] !log tgr@deploy2002 Started scap: Backport for [[gerrit:976844|Deploy Annual Plan Core Metrics survey (T351353)]] [21:20:33] T351353: Deploy survey (Community Feedback on Core Metrics Reports) on Meta-Wiki - https://phabricator.wikimedia.org/T351353 [21:21:18] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:21:36] (03Merged) 10jenkins-bot: Fix incorrect client-pref-pinned classes when client pref feature is disabled [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978513 (https://phabricator.wikimedia.org/T351141) (owner: 10Jdlrobson) [21:21:48] !log tgr@deploy2002 tgr and dani: Backport for [[gerrit:976844|Deploy Annual Plan Core Metrics survey (T351353)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:22:15] danisztls: ^ [21:24:04] tgr: thanks! [21:27:02] danisztls: do you want to test it on mwdebug or is it good to go? [21:27:17] tgr: good to go [21:27:23] !log tgr@deploy2002 tgr and dani: Continuing with sync [21:27:36] I assume the other patch doesn't need testing either? [21:28:04] tgr: It doesn't [21:33:23] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:33:38] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:34:15] !log tgr@deploy2002 Finished scap: Backport for [[gerrit:976844|Deploy Annual Plan Core Metrics survey (T351353)]] (duration: 13m 47s) [21:34:20] T351353: Deploy survey (Community Feedback on Core Metrics Reports) on Meta-Wiki - https://phabricator.wikimedia.org/T351353 [21:35:55] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:37:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:37:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2088.codfw.wmnet with OS bookworm [21:40:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2089.codfw.wmnet with OS bookworm [21:41:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978683 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:44:42] (03PS1) 10Dzahn: phabricator: if distro newer than buster, use python3-mysqldb [puppet] - 10https://gerrit.wikimedia.org/r/978697 (https://phabricator.wikimedia.org/T334519) [21:44:53] (03PS2) 10Gergő Tisza: Update coverage of Reader Demographics 2 surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978683 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:45:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978683 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:45:54] (03Merged) 10jenkins-bot: Update coverage of Reader Demographics 2 surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978683 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:49:49] !log tgr@deploy2002 Backport cancelled. [21:50:44] !log tgr@deploy2002 Started scap: Backport for [[gerrit:978683|Update coverage of Reader Demographics 2 surveys (T344393)]], [[gerrit:978513|Fix incorrect client-pref-pinned classes when client pref feature is disabled (T351141 T352257)]] [21:50:52] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [21:50:52] T351141: Make the client preferences controls pinnable - https://phabricator.wikimedia.org/T351141 [21:50:52] T352257: [subtask] Vector 2022 "Tools" menu collapsing does not make main content take that space anymore - https://phabricator.wikimedia.org/T352257 [21:52:07] !log tgr@deploy2002 dani and tgr and jdlrobson: Backport for [[gerrit:978683|Update coverage of Reader Demographics 2 surveys (T344393)]], [[gerrit:978513|Fix incorrect client-pref-pinned classes when client pref feature is disabled (T351141 T352257)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:53:03] jan_drewniak: ^ [21:53:56] tgr: looks good! [21:55:08] !log tgr@deploy2002 dani and tgr and jdlrobson: Continuing with sync [21:58:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2089.codfw.wmnet with reason: host reimage [22:00:06] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231129T2200) [22:01:20] !log tgr@deploy2002 Finished scap: Backport for [[gerrit:978683|Update coverage of Reader Demographics 2 surveys (T344393)]], [[gerrit:978513|Fix incorrect client-pref-pinned classes when client pref feature is disabled (T351141 T352257)]] (duration: 10m 35s) [22:01:32] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [22:01:32] T351141: Make the client preferences controls pinnable - https://phabricator.wikimedia.org/T351141 [22:01:32] T352257: [subtask] Vector 2022 "Tools" menu collapsing does not make main content take that space anymore - https://phabricator.wikimedia.org/T352257 [22:01:49] jan_drewniak: danisztls: all live [22:02:20] (03CR) 10JHathaway: P:base::production: update hiera preference public vs private (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761340 (https://phabricator.wikimedia.org/T301349) (owner: 10Jbond) [22:02:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2089.codfw.wmnet with reason: host reimage [22:03:53] (03CR) 10JHathaway: wmflib: add new functions to update a hash with randome secrets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841479 (owner: 10Jbond) [22:04:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977791 (owner: 10Gergő Tisza) [22:06:02] (03PS2) 10Gergő Tisza: mobile: Remove $wgMobileUrlTemplate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977791 [22:06:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977791 (owner: 10Gergő Tisza) [22:06:56] (03Merged) 10jenkins-bot: mobile: Remove $wgMobileUrlTemplate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977791 (owner: 10Gergő Tisza) [22:07:13] (03CR) 10Ryan Kemper: [C: 03+1] query_service: point wdqs ldf endpoint to new CNAME [puppet] - 10https://gerrit.wikimedia.org/r/978134 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [22:07:17] !log tgr@deploy2002 Started scap: Backport for [[gerrit:977791|mobile: Remove $wgMobileUrlTemplate]] [22:07:25] (03CR) 10Bking: [C: 03+2] query_service: point wdqs ldf endpoint to new CNAME [puppet] - 10https://gerrit.wikimedia.org/r/978134 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [22:08:38] !log tgr@deploy2002 tgr: Backport for [[gerrit:977791|mobile: Remove $wgMobileUrlTemplate]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:10:20] !log bking@cumin2002 running puppet against cp hosts to apply 978134 [22:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:01] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [22:19:34] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:19:42] !log tgr@deploy2002 tgr: Continuing with sync [22:21:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:21:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2089.codfw.wmnet with OS bookworm [22:22:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2090.codfw.wmnet with OS bookworm [22:23:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Papaul) [22:23:46] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10Papaul) [22:24:23] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10Papaul) 05Open→03Resolved @colewhite all your's [22:24:53] (03CR) 10Clare Ming: "per discussion with @Kimberly_Sarabia, new config should add a new stream for `web_ui_scroll` (not replace current one) that points to MP " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia) [22:25:26] (03PS1) 10Bking: miscweb: change wdqs ldf endpoint blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/978700 (https://phabricator.wikimedia.org/T347355) [22:27:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978700 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [22:28:10] !log tgr@deploy2002 Finished scap: Backport for [[gerrit:977791|mobile: Remove $wgMobileUrlTemplate]] (duration: 20m 53s) [22:28:53] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/777/console" [puppet] - 10https://gerrit.wikimedia.org/r/978700 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [22:31:30] (03CR) 10Bking: [C: 03+2] miscweb: change wdqs ldf endpoint blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/978700 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [22:31:33] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+1] miscweb: change wdqs ldf endpoint blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/978700 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [22:40:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2090.codfw.wmnet with reason: host reimage [22:44:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2090.codfw.wmnet with reason: host reimage [22:56:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2091.codfw.wmnet with OS bookworm [22:56:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2091.codfw.wmnet with OS bookworm [22:56:57] (03PS3) 10Kimberly Sarabia: Define the corresponding stream for scroll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) [23:01:41] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:02:15] (03CR) 10Krinkle: [C: 03+1] "I'm waiting for the to deploy before merging the core patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976787 (https://phabricator.wikimedia.org/T351559) (owner: 10Ladsgroup) [23:03:38] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/978697/778/" [puppet] - 10https://gerrit.wikimedia.org/r/978697 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [23:04:11] jouncebot: nowandnext [23:04:11] No deployments scheduled for the next 7 hour(s) and 55 minute(s) [23:04:11] In 7 hour(s) and 55 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T0700) [23:04:12] In 7 hour(s) and 55 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T0700) [23:04:20] (03PS2) 10Ladsgroup: Add virtual domain for botpasswords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976787 (https://phabricator.wikimedia.org/T351559) [23:04:33] (03CR) 10Ladsgroup: [C: 03+2] "Going up!!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976787 (https://phabricator.wikimedia.org/T351559) (owner: 10Ladsgroup) [23:04:58] (JobUnavailable) firing: (10) Reduced availability for job lvs_realserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:05:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:05:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2090.codfw.wmnet with OS bookworm [23:05:22] (03Merged) 10jenkins-bot: Add virtual domain for botpasswords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976787 (https://phabricator.wikimedia.org/T351559) (owner: 10Ladsgroup) [23:05:49] (03PS1) 10Papaul: Add new ganeti nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/978707 (https://phabricator.wikimedia.org/T349926) [23:06:07] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:976787|Add virtual domain for botpasswords (T351559)]] [23:06:18] (03CR) 10CI reject: [V: 04-1] Add new ganeti nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/978707 (https://phabricator.wikimedia.org/T349926) (owner: 10Papaul) [23:06:23] T351559: Migrate bot passwords to use a virtual database domain - https://phabricator.wikimedia.org/T351559 [23:07:10] (03CR) 10Dzahn: "mixed brackets [ vs )" [puppet] - 10https://gerrit.wikimedia.org/r/978707 (https://phabricator.wikimedia.org/T349926) (owner: 10Papaul) [23:07:29] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:976787|Add virtual domain for botpasswords (T351559)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:08:19] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop in production, package installed on cloud test instance" [puppet] - 10https://gerrit.wikimedia.org/r/978697 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [23:08:53] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [23:08:54] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:10:13] (03PS2) 10Papaul: Add new ganeti nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/978707 (https://phabricator.wikimedia.org/T349926) [23:10:30] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:12:02] (03CR) 10Papaul: [C: 03+2] Add new ganeti nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/978707 (https://phabricator.wikimedia.org/T349926) (owner: 10Papaul) [23:15:36] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:976787|Add virtual domain for botpasswords (T351559)]] (duration: 09m 28s) [23:15:41] T351559: Migrate bot passwords to use a virtual database domain - https://phabricator.wikimedia.org/T351559 [23:16:10] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:16:50] (03PS1) 10Dzahn: phabricator: turn deploy script into template, support for php7.4-fpm [puppet] - 10https://gerrit.wikimedia.org/r/978710 (https://phabricator.wikimedia.org/T334519) [23:19:59] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/output/978710/779/phab1004.eqiad.wmnet/change.phab1004.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/978710 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [23:20:49] (03PS2) 10Dzahn: phabricator: turn deploy script into template, support for php7.4-fpm [puppet] - 10https://gerrit.wikimedia.org/r/978710 (https://phabricator.wikimedia.org/T334519) [23:22:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2033.codfw.wmnet with OS bullseye [23:23:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ganeti2033.codfw.wmnet with OS bullseye [23:24:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2091.codfw.wmnet with reason: host reimage [23:28:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2091.codfw.wmnet with reason: host reimage [23:31:02] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/output/978710/780/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/978710 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [23:32:57] 10ops-eqiad, 10Cassandra, 10DC-Ops: hw troubleshooting: SSD failure (/dev/sde) for aqs1013.eqiad.wmnet - https://phabricator.wikimedia.org/T352344 (10Eevans) [23:34:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Papaul) [23:40:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2034.codfw.wmnet with OS bullseye [23:40:52] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ganeti2034.codfw.wmnet with OS bullseye [23:41:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10Papaul) [23:43:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2034.codfw.wmnet with OS bullseye [23:43:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ganeti2034.codfw.wmnet with OS bullseye [23:44:40] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:45:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:45:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2033.codfw.wmnet with reason: host reimage [23:45:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2091.codfw.wmnet with OS bookworm [23:45:53] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2091.codfw.wmnet with OS bookworm completed: - elastic2091 (**PASS**)... [23:46:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Papaul) [23:47:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Papaul) 05Open→03Resolved @bking all your's [23:49:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2033.codfw.wmnet with reason: host reimage [23:54:31] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable