[00:01:56] (03PS1) 10Papaul: The installer on new nodes readingreuse-parts.cfg first making it to fail [puppet] - 10https://gerrit.wikimedia.org/r/977804 (https://phabricator.wikimedia.org/T349758) [00:03:18] (03CR) 10Papaul: [C: 03+2] The installer on new nodes readingreuse-parts.cfg first making it to fail [puppet] - 10https://gerrit.wikimedia.org/r/977804 (https://phabricator.wikimedia.org/T349758) (owner: 10Papaul) [00:15:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2028.codfw.wmnet with reason: host reimage [00:19:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2028.codfw.wmnet with reason: host reimage [00:23:27] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:23:27] (JobUnavailable) firing: (3) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:33:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2029.codfw.wmnet with OS bullseye [00:33:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2029.codfw.wmnet with OS bullseye [00:36:44] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/977712 [00:38:33] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/977712 (owner: 10TrainBranchBot) [00:38:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:38:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2028.codfw.wmnet with OS bullseye [00:38:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2028.codfw.wmnet with OS bullseye completed: - restbase2028 (**WARN*... [00:45:24] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) [00:51:01] (03CR) 10Andrew Bogott: [C: 03+2] vendordata: pin puppet packages to wikimedia repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927664 (https://phabricator.wikimedia.org/T338195) (owner: 10Andrew Bogott) [00:51:53] (03CR) 10Andrew Bogott: [C: 03+2] cloudlb2001: use new cloud-private vlan addresses for designate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920795 (https://phabricator.wikimedia.org/T336808) (owner: 10Andrew Bogott) [00:52:31] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase2029.codfw.wmnet with OS bullseye [00:52:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2029.codfw.wmnet with OS bullseye executed with errors: - restbase20... [00:52:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) @Jhancock.wm this is ready now for OS install. I did a test on restbase2028,PASS. on 2029 it looks like the network cable is not plug into NIC 1 since the MAC add... [00:56:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2096.mgmt.codfw.wmnet with reboot policy FORCED [00:56:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2097.mgmt.codfw.wmnet with reboot policy FORCED [00:56:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2101.mgmt.codfw.wmnet with reboot policy FORCED [00:56:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2034.mgmt.codfw.wmnet with reboot policy FORCED [00:57:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/977712 (owner: 10TrainBranchBot) [00:57:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2096.mgmt.codfw.wmnet with reboot policy FORCED [00:58:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2096.mgmt.codfw.wmnet with reboot policy FORCED [01:03:50] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T352124 (10phaultfinder) [01:04:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2097.mgmt.codfw.wmnet with reboot policy FORCED [01:05:58] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2097'] [01:06:13] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2097'] [01:06:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2101.mgmt.codfw.wmnet with reboot policy FORCED [01:06:26] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2097'] [01:07:04] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2101'] [01:07:17] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2101'] [01:07:26] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2101'] [01:07:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2034.mgmt.codfw.wmnet with reboot policy FORCED [01:08:42] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti2034'] [01:09:02] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ganeti2034'] [01:09:18] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti2034'] [01:09:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2096.mgmt.codfw.wmnet with reboot policy FORCED [01:11:38] (03PS9) 10Andrea Denisse: ircecho: Migrate the ircecho script from Python 2 to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) [01:11:59] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2091'] [01:12:06] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2094'] [01:12:13] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2100'] [01:12:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2091'] [01:12:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2100'] [01:12:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2097'] [01:13:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2101'] [01:13:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2091.mgmt.codfw.wmnet with reboot policy FORCED [01:13:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2100.mgmt.codfw.wmnet with reboot policy FORCED [01:14:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti2034'] [01:16:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2091.mgmt.codfw.wmnet with reboot policy FORCED [01:16:05] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2100.mgmt.codfw.wmnet with reboot policy FORCED [01:17:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2091.mgmt.codfw.wmnet with reboot policy FORCED [01:17:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2101.mgmt.codfw.wmnet with reboot policy FORCED [01:17:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2101.mgmt.codfw.wmnet with reboot policy FORCED [01:17:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2094'] [01:18:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2100.mgmt.codfw.wmnet with reboot policy FORCED [01:20:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10Jhancock.wm) [01:24:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2091.mgmt.codfw.wmnet with reboot policy FORCED [01:24:49] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2091'] [01:26:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2100.mgmt.codfw.wmnet with reboot policy FORCED [01:27:22] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2100'] [01:30:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2091'] [01:32:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Jhancock.wm) [01:34:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2100'] [01:40:54] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [01:42:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:42:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2096.mgmt.codfw.wmnet with reboot policy FORCED [01:46:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2096.mgmt.codfw.wmnet with reboot policy FORCED [01:50:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Jhancock.wm) [01:53:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Jhancock.wm) server elastic2096 is having an issue with the provisioning script. I did check the cable and tried redoing the netbox script. mgmt ip is still unpingable. g... [02:04:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [02:38:27] (JobUnavailable) firing: (4) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:06] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:43:04] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:47:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10VRiley-WMF) kubernetes1059 Rack: E 1 Position: U42 CableID: 5-02562 Port: 41 kubernetes1060 Rack: E 2 Position: U 43 CableID: 5-02561 Port: 44 kubernetes1061 Rack: E 3 Positio... [02:47:26] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:47:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10VRiley-WMF) [02:59:17] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-10-12-080927-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [02:59:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10VRiley-WMF) elastic1103 Rack: D 4 Position: U 17 CableID: 230304500226 Port 46 elastic1104 Rack: E 1 Position: U 14 CableID: 20220222 Port: 2 elastic1105 Rack: E 5 Posit... [02:59:30] (03CR) 10KartikMistry: Use Parsoid for all Wikis for Content Translation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930744 (https://phabricator.wikimedia.org/T339322) (owner: 10KartikMistry) [02:59:53] (03PS3) 10KartikMistry: Update Apertium to 2023-11-23-055425-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977183 (https://phabricator.wikimedia.org/T346997) [03:00:08] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T0300) [03:00:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10VRiley-WMF) [03:07:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.7 [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/977713 (https://phabricator.wikimedia.org/T350083) [03:07:16] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.7 [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/977713 (https://phabricator.wikimedia.org/T350083) (owner: 10TrainBranchBot) [03:08:27] (JobUnavailable) firing: (4) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:19:27] (03PS11) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) [03:19:29] (03PS8) 10Pppery: Run generate.php and arc liberate [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) [03:24:06] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.7 [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/977713 (https://phabricator.wikimedia.org/T350083) (owner: 10TrainBranchBot) [03:55:32] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv6: Connect - Init7, AS6939/IPv4: Connect - HE, AS13030/IPv4: Connect - Init7, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T0400) [04:01:27] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977817 (https://phabricator.wikimedia.org/T350083) [04:01:29] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977817 (https://phabricator.wikimedia.org/T350083) (owner: 10TrainBranchBot) [04:02:13] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977817 (https://phabricator.wikimedia.org/T350083) (owner: 10TrainBranchBot) [04:02:42] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.7 refs T350083 [04:02:47] T350083: 1.42.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T350083 [04:12:44] (03PS1) 10Subramanya Sastry: Revert "Parsoid DataAccess: Stop processing extensions as top-level docs" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/977625 [04:13:24] (03CR) 10Subramanya Sastry: [C: 03+1] "This should be merged before rolling out the train." [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/977625 (owner: 10Subramanya Sastry) [04:23:27] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:53:54] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.7 refs T350083 (duration: 51m 11s) [04:53:59] T350083: 1.42.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T350083 [04:56:10] !log mwpresync@deploy2002 Pruned MediaWiki: 1.42.0-wmf.4 (duration: 02m 14s) [04:58:02] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:30] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:09:10] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [06:04:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [06:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:13:02] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:17:26] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:25:34] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [06:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T0700) [07:00:06] kormat, marostegui, and Amir1: That opportune time is upon us again. Time for a Primary database switchover deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T0700). [07:08:27] (JobUnavailable) firing: (3) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:24:27] (03PS1) 10KartikMistry: Update cxserver to 2023-11-28-064518-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) [07:32:40] (03CR) 10Ayounsi: P:netbase: parse the service catalogue and inject the service ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [07:33:02] (03PS1) 10Slyngshede: P:IDM Send creation email from no-reply. [puppet] - 10https://gerrit.wikimedia.org/r/977984 [07:39:31] !log add RPKI ROA for 193.46.90.0/24 - T309297 [07:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:42] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:46:10] (03CR) 10Ayounsi: [C: 03+1] "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/977984 (owner: 10Slyngshede) [07:48:48] (03PS1) 10Santiago Faci: Remove partial migration of VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977986 (https://phabricator.wikimedia.org/T351337) [08:00:05] Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:21] (03PS1) 10Muehlenhoff: Extend access for hghani [puppet] - 10https://gerrit.wikimedia.org/r/977987 [08:02:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/977984 (owner: 10Slyngshede) [08:03:26] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for hghani [puppet] - 10https://gerrit.wikimedia.org/r/977987 (owner: 10Muehlenhoff) [08:07:04] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:13:46] (03CR) 10Muehlenhoff: [C: 03+2] Use separate /etc/ganeti/ssl directory if using PKI [puppet] - 10https://gerrit.wikimedia.org/r/977677 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [08:15:56] (03PS1) 10Kevin Bazira: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/977714 (https://phabricator.wikimedia.org/T343123) [08:16:00] (03CR) 10Slyngshede: [C: 03+2] P:IDM Send creation email from no-reply. [puppet] - 10https://gerrit.wikimedia.org/r/977984 (owner: 10Slyngshede) [08:19:41] good morning [08:19:46] jouncebot: now [08:19:46] For the next 0 hour(s) and 40 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T0800) [08:19:49] cool [08:19:53] (03CR) 10Hashar: [C: 03+2] Revert "Parsoid DataAccess: Stop processing extensions as top-level docs" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/977625 (owner: 10Subramanya Sastry) [08:20:35] (03PS1) 10Marostegui: db1119: Move it to m3 [puppet] - 10https://gerrit.wikimedia.org/r/977991 (https://phabricator.wikimedia.org/T351990) [08:22:33] dbproxy irc alerts are to be expected [08:23:27] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:24:06] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:56] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:25:06] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:25:43] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:26:14] (03CR) 10Marostegui: [C: 03+2] db1119: Move it to m3 [puppet] - 10https://gerrit.wikimedia.org/r/977991 (https://phabricator.wikimedia.org/T351990) (owner: 10Marostegui) [08:28:47] (03PS1) 10Muehlenhoff: ganeti: Drop explicit require in pki query [puppet] - 10https://gerrit.wikimedia.org/r/977992 [08:30:24] (03PS7) 10Majavah: cloudnfs: refactor configuration [puppet] - 10https://gerrit.wikimedia.org/r/931584 [08:30:43] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:31:07] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Drop explicit require in pki query [puppet] - 10https://gerrit.wikimedia.org/r/977992 (owner: 10Muehlenhoff) [08:31:30] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:15] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/931584 (owner: 10Majavah) [08:34:28] (03CR) 10Majavah: [V: 03+1] cloudnfs: refactor configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931584 (owner: 10Majavah) [08:34:34] (03CR) 10Majavah: [V: 03+1 C: 03+2] cloudnfs: refactor configuration [puppet] - 10https://gerrit.wikimedia.org/r/931584 (owner: 10Majavah) [08:34:54] RECOVERY - HTTPS Ganeti RAPI codfw on ganeti-test2003 is OK: HTTP OK: Status line output matched 401 - 308 bytes in 0.013 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [08:36:00] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:06] (03CR) 10Giuseppe Lavagetto: Expose Netbox's BGP servers to Homer (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [08:36:16] (03PS1) 10Marostegui: dbproxy1024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/977993 (https://phabricator.wikimedia.org/T351864) [08:37:07] 10SRE, 10Traffic: Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers - https://phabricator.wikimedia.org/T352143 (10Vgutierrez) [08:37:25] (03CR) 10Marostegui: [C: 03+2] dbproxy1024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/977993 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [08:37:26] 10SRE, 10Traffic: Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers - https://phabricator.wikimedia.org/T352143 (10Vgutierrez) p:05Triage→03High [08:37:26] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:27] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:37:28] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:37:36] (03Merged) 10jenkins-bot: Revert "Parsoid DataAccess: Stop processing extensions as top-level docs" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/977625 (owner: 10Subramanya Sastry) [08:37:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1024.eqiad.wmnet with OS bookworm [08:39:36] !log hashar@deploy2002 Started scap: Backport for [[gerrit:977625|Revert "Parsoid DataAccess: Stop processing extensions as top-level docs"]] [08:41:03] !log hashar@deploy2002 hashar and ssastry: Backport for [[gerrit:977625|Revert "Parsoid DataAccess: Stop processing extensions as top-level docs"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:41:08] !log hashar@deploy2002 hashar and ssastry: Continuing with sync [08:42:04] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:44:06] (03PS1) 10Brouberol: Mention the fact that 'history' is a valid argument in the error message [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) [08:45:29] 10SRE, 10Traffic: Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers - https://phabricator.wikimedia.org/T352143 (10Vgutierrez) using the syntax on the good old iptables, this should work: ` iptables -A INPUT -s 172.16.0.0/10 -p ipencap -j ACCEPT ip6tables -A INPUT -s 0100::/64 -... [08:47:31] !log hashar@deploy2002 Finished scap: Backport for [[gerrit:977625|Revert "Parsoid DataAccess: Stop processing extensions as top-level docs"]] (duration: 07m 54s) [08:48:20] (03CR) 10Volans: puppet-merge: add prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977184 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [08:50:30] RECOVERY - HTTPS Ganeti RAPI eqiad on ganeti-test1001 is OK: HTTP OK: Status line output matched 401 - 308 bytes in 0.009 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [08:50:50] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:51:34] (03PS1) 10Vgutierrez: ncredir: Allow IPIP/IP6IP6 inbound traffic [puppet] - 10https://gerrit.wikimedia.org/r/977997 (https://phabricator.wikimedia.org/T352143) [08:52:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1024.eqiad.wmnet with reason: host reimage [08:52:12] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:46] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:00] (03CR) 10Volans: [C: 04-1] "I think there is a small issue with the binary name." [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [08:55:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1024.eqiad.wmnet with reason: host reimage [08:56:42] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:56:52] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:59:19] (03PS2) 10Vgutierrez: ncredir: Allow IPIP/IP6IP6 inbound traffic [puppet] - 10https://gerrit.wikimedia.org/r/977997 (https://phabricator.wikimedia.org/T352143) [09:00:05] hashar and jeena: That opportune time is upon us again. Time for a MediaWiki train - Utc-0+Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T0900). [09:00:37] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/730/con" [puppet] - 10https://gerrit.wikimedia.org/r/977997 (https://phabricator.wikimedia.org/T352143) (owner: 10Vgutierrez) [09:03:12] (03PS3) 10Vgutierrez: ncredir: Allow IPIP/IP6IP6 inbound traffic [puppet] - 10https://gerrit.wikimedia.org/r/977997 (https://phabricator.wikimedia.org/T352143) [09:03:43] (03CR) 10CI reject: [V: 04-1] ncredir: Allow IPIP/IP6IP6 inbound traffic [puppet] - 10https://gerrit.wikimedia.org/r/977997 (https://phabricator.wikimedia.org/T352143) (owner: 10Vgutierrez) [09:04:49] (03PS4) 10Vgutierrez: ncredir: Allow IPIP/IP6IP6 inbound traffic [puppet] - 10https://gerrit.wikimedia.org/r/977997 (https://phabricator.wikimedia.org/T352143) [09:06:41] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/732/con" [puppet] - 10https://gerrit.wikimedia.org/r/977997 (https://phabricator.wikimedia.org/T352143) (owner: 10Vgutierrez) [09:07:08] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:13:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1024.eqiad.wmnet with OS bookworm [09:26:20] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: ServiceLVS without monitor breaks spicerack - https://phabricator.wikimedia.org/T339243 (10Volans) We had only a couple of changes in the service.yaml schema in the last months and both were sent to Spicerack before hitting product... [09:28:06] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:19] I am doing the train [09:30:19] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: adapt conftool module for etcd v3 - https://phabricator.wikimedia.org/T352153 (10Volans) [09:31:19] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978001 (https://phabricator.wikimedia.org/T350083) [09:31:21] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978001 (https://phabricator.wikimedia.org/T350083) (owner: 10TrainBranchBot) [09:31:47] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: migrate distributed locking to etcd v3 - https://phabricator.wikimedia.org/T352155 (10Volans) [09:32:34] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:45] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978001 (https://phabricator.wikimedia.org/T350083) (owner: 10TrainBranchBot) [09:40:16] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.7 refs T350083 [09:40:22] T350083: 1.42.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T350083 [09:40:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/977997 (https://phabricator.wikimedia.org/T352143) (owner: 10Vgutierrez) [09:41:16] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: [spicerack] Add remote command output to log file - https://phabricator.wikimedia.org/T347093 (10Volans) 05Open→03Declined As there is already a workaround to do that in the cookbooks on demand and it will be even simpler wi... [09:44:19] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] ncredir: Allow IPIP/IP6IP6 inbound traffic [puppet] - 10https://gerrit.wikimedia.org/r/977997 (https://phabricator.wikimedia.org/T352143) (owner: 10Vgutierrez) [09:46:55] oh joy stuff got broken :) [09:49:14] over two weeks? shocking [09:50:04] (03PS1) 10Vgutierrez: Revert "service: Disable IPIP encapsulation for ncredir@ulsfo" [puppet] - 10https://gerrit.wikimedia.org/r/978008 [09:50:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [09:52:04] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/733/con" [puppet] - 10https://gerrit.wikimedia.org/r/978008 (owner: 10Vgutierrez) [09:52:26] so here is the story [09:53:02] since yesterday 8:24:26, MediaWiki on kubernetes complains about: `[{reqId}] {exception_url} PHP Warning: geoip_country_code_by_name(): Required database not available at /usr/share/GeoIP/GeoIP.dat.` [09:53:10] https://phabricator.wikimedia.org/T352156 [09:53:39] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10Volans) p:05Triage→03Low As the main blocker was resolved giving more permissions to the bot in T314917, setting the priority lower fo... [09:54:00] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] Revert "service: Disable IPIP encapsulation for ncredir@ulsfo" [puppet] - 10https://gerrit.wikimedia.org/r/978008 (owner: 10Vgutierrez) [09:54:12] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() - https://phabricator.wikimedia.org/T311050 (10JMeybohm) Sorry, I must have missed the message. Yes, IIRC that is the correct interpretation. [09:56:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [09:58:08] !log installing intel-microcode security updates [09:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:50] !log sg912@deploy2002 Started deploy [airflow-dags/analytics@0283c11]: (no justification provided) [10:01:37] !log sg912@deploy2002 Finished deploy [airflow-dags/analytics@0283c11]: (no justification provided) (duration: 00m 47s) [10:04:02] (03CR) 10FNegri: [C: 03+1] "Sounds reasonable!" [puppet] - 10https://gerrit.wikimedia.org/r/977741 (https://phabricator.wikimedia.org/T352059) (owner: 10Majavah) [10:04:30] (03CR) 10Majavah: [C: 03+2] P:alertmanager: wmcs: do not group by instance [puppet] - 10https://gerrit.wikimedia.org/r/977741 (https://phabricator.wikimedia.org/T352059) (owner: 10Majavah) [10:04:39] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [10:05:07] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() - https://phabricator.wikimedia.org/T311050 (10Volans) p:05Triage→03Medium Perfect, thanks for the update. [10:08:27] (03PS2) 10Jcrespo: Add tmpdir removal, now that upload is stable [software/mediabackups] - 10https://gerrit.wikimedia.org/r/922830 (https://phabricator.wikimedia.org/T327157) [10:08:29] (03PS2) 10Jcrespo: Increase unit test coverage for File, MySQLMedia and MySQLMetadata [software/mediabackups] - 10https://gerrit.wikimedia.org/r/922907 (https://phabricator.wikimedia.org/T327157) [10:09:24] !log rolling restart of pybal on lvs4010 and lvs4008, effectively enabling IPIP encapsulation on ncredir@ulsfo - T351069 [10:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:32] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [10:11:07] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Investigate converting LBRemoteCluster cookbooks to SRELBBatchRunnerBase - https://phabricator.wikimedia.org/T318787 (10Volans) As all the above cookbooks are already listed in T317855 I'm resolving this as duplicate. [10:11:29] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Investigate converting LBRemoteCluster cookbooks to SRELBBatchRunnerBase - https://phabricator.wikimedia.org/T318787 (10Volans) [10:11:33] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Migrate existing cookbooks related to rolling restarts/reboots to SREBatchBase - https://phabricator.wikimedia.org/T317855 (10Volans) [10:12:11] (03CR) 10Jcrespo: [C: 03+2] Add tmpdir removal, now that upload is stable [software/mediabackups] - 10https://gerrit.wikimedia.org/r/922830 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [10:12:45] (03CR) 10Jcrespo: [C: 03+2] Increase unit test coverage for File, MySQLMedia and MySQLMetadata [software/mediabackups] - 10https://gerrit.wikimedia.org/r/922907 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [10:13:07] (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:13:11] !incidents [10:13:12] 4284 (UNACKED) ProbeDown sre (198.35.26.98 ip4 ncredir-https:443 probes/service http_ncredir-https_ip4 ulsfo) [10:13:12] 4283 (RESOLVED) HaproxyUnavailable cache_text global sre () [10:13:12] 4282 (RESOLVED) [2x] ProbeDown sre (ncredir-https:443 probes/service ulsfo) [10:13:20] !ack 4284 [10:13:20] 4284 (ACKED) ProbeDown sre (198.35.26.98 ip4 ncredir-https:443 probes/service http_ncredir-https_ip4 ulsfo) [10:13:28] here [10:13:37] vgutierrez: is that you? [10:13:42] yes [10:13:57] anything I cn do to help? [10:14:25] اثقث [10:14:28] re [10:14:30] here [10:14:34] ugh [10:14:54] (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:15:22] Amir1: is related to the ip-ip encapsulation tests for ncredir [10:15:47] in ulsfo [10:15:51] yeah, I'm around if needed :P [10:16:20] me too, vgutierrez anything we cn do to help? [10:16:37] nope, I'm debugging what's going on, thx [10:16:52] ack, ping/page if you need us [10:17:35] (03PS1) 10Vgutierrez: Revert "Revert "service: Disable IPIP encapsulation for ncredir@ulsfo"" [puppet] - 10https://gerrit.wikimedia.org/r/978009 [10:18:53] (03CR) 10Vgutierrez: [C: 03+2] Revert "Revert "service: Disable IPIP encapsulation for ncredir@ulsfo"" [puppet] - 10https://gerrit.wikimedia.org/r/978009 (owner: 10Vgutierrez) [10:21:19] !log rolling restart of pybal on lvs4010 and lvs4008, effectively disabling IPIP encapsulation on ncredir@ulsfo - T351069 [10:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:24] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [10:22:44] so beside GeoIP.data missing since yesterday there is not much going on. I have a few errands to do here at home, I will be back after lunch. [10:22:53] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [10:23:07] (ProbeDown) resolved: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:23:59] (ProbeDown) resolved: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:25:18] \o/ [10:35:37] !log depool ncredir4001 [10:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:17] (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/977714 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [10:37:07] (03Merged) 10jenkins-bot: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/977714 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [10:37:37] !log installing lua5.3 security updates [10:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:45] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:40:36] 10SRE, 10Traffic: Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers - https://phabricator.wikimedia.org/T352143 (10Vgutierrez) 05Open→03Resolved [10:40:41] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [10:41:20] (03CR) 10Hnowlan: [C: 03+1] Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/977660 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [10:41:28] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [10:42:46] (03PS1) 10Muehlenhoff: Add library hint for lua5.3 [puppet] - 10https://gerrit.wikimedia.org/r/978026 [10:43:52] 10SRE, 10Traffic: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers - https://phabricator.wikimedia.org/T352160 (10Vgutierrez) [10:44:14] 10SRE, 10Traffic: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers - https://phabricator.wikimedia.org/T352160 (10Vgutierrez) p:05Triage→03High [10:45:01] !log repool ncredir4001 [10:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:50] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for lua5.3 [puppet] - 10https://gerrit.wikimedia.org/r/978026 (owner: 10Muehlenhoff) [10:46:54] (03CR) 10Jbond: [V: 03+2 C: 03+2] gerrit: use prod known hosts [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/971924 (owner: 10Jbond) [10:47:57] (03CR) 10Kamila Součková: [C: 03+2] mw-api-int: increase replicas by 50% [deployment-charts] - 10https://gerrit.wikimedia.org/r/977683 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [10:47:59] (03PS2) 10Btullis: Upgrade airflow on the analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/977632 (https://phabricator.wikimedia.org/T351621) [10:48:01] (03PS2) 10Btullis: Upgrade airflow on the search instance [puppet] - 10https://gerrit.wikimedia.org/r/977633 (https://phabricator.wikimedia.org/T351621) [10:48:03] (03PS2) 10Btullis: Upgrade airflow on the research instance [puppet] - 10https://gerrit.wikimedia.org/r/977634 (https://phabricator.wikimedia.org/T351621) [10:48:05] (03PS2) 10Btullis: Upgrade airflow on the platform_eng instance [puppet] - 10https://gerrit.wikimedia.org/r/977635 (https://phabricator.wikimedia.org/T351621) [10:48:07] (03PS2) 10Btullis: Upgrade airflow on the analytics_product [puppet] - 10https://gerrit.wikimedia.org/r/977636 (https://phabricator.wikimedia.org/T351621) [10:48:09] (03PS2) 10Btullis: Upgrade airflow on wmde [puppet] - 10https://gerrit.wikimedia.org/r/977637 (https://phabricator.wikimedia.org/T351621) [10:48:11] (03PS2) 10Btullis: Update default airflow_version and remove overrides [puppet] - 10https://gerrit.wikimedia.org/r/977638 (https://phabricator.wikimedia.org/T351621) [10:48:13] (03PS1) 10Btullis: Update the version of airflow on analytics_test [puppet] - 10https://gerrit.wikimedia.org/r/978028 (https://phabricator.wikimedia.org/T343232) [10:48:16] (03CR) 10Effie Mouzeli: [C: 03+2] tegola: update image to pick up OS fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/973817 (https://phabricator.wikimedia.org/T348647) (owner: 10Effie Mouzeli) [10:48:44] (03CR) 10Jbond: [C: 03+2] bacula: only export resources if we have puppetdb support [puppet] - 10https://gerrit.wikimedia.org/r/617157 (owner: 10Jbond) [10:48:59] (03Merged) 10jenkins-bot: mw-api-int: increase replicas by 50% [deployment-charts] - 10https://gerrit.wikimedia.org/r/977683 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [10:50:49] (03CR) 10CI reject: [V: 04-1] Update the version of airflow on analytics_test [puppet] - 10https://gerrit.wikimedia.org/r/978028 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [10:51:16] (03Merged) 10jenkins-bot: tegola: update image to pick up OS fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/973817 (https://phabricator.wikimedia.org/T348647) (owner: 10Effie Mouzeli) [10:52:06] !log depool ncredir4001 [10:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:20] (03PS1) 10Muehlenhoff: ganeti/pki: Use chained cert [puppet] - 10https://gerrit.wikimedia.org/r/978029 (https://phabricator.wikimedia.org/T350686) [10:52:52] (03CR) 10Jbond: [V: 03+1] puppetserver: create a necessary parent dirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott) [10:53:23] (03CR) 10Btullis: Mention the fact that 'history' is a valid argument in the error message (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [10:53:59] (JobUnavailable) firing: (4) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:54:25] (03PS2) 10Btullis: Upgrade airflow on the analytics_test instance [puppet] - 10https://gerrit.wikimedia.org/r/978028 (https://phabricator.wikimedia.org/T343232) [10:54:27] (03PS3) 10Btullis: Upgrade airflow on the analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/977632 (https://phabricator.wikimedia.org/T351621) [10:54:29] (03PS3) 10Btullis: Upgrade airflow on the search instance [puppet] - 10https://gerrit.wikimedia.org/r/977633 (https://phabricator.wikimedia.org/T351621) [10:54:31] (03PS3) 10Btullis: Upgrade airflow on the research instance [puppet] - 10https://gerrit.wikimedia.org/r/977634 (https://phabricator.wikimedia.org/T351621) [10:54:33] (03PS3) 10Btullis: Upgrade airflow on the platform_eng instance [puppet] - 10https://gerrit.wikimedia.org/r/977635 (https://phabricator.wikimedia.org/T351621) [10:54:35] (03PS3) 10Btullis: Upgrade airflow on the analytics_product [puppet] - 10https://gerrit.wikimedia.org/r/977636 (https://phabricator.wikimedia.org/T351621) [10:54:37] (03PS3) 10Btullis: Upgrade airflow on wmde [puppet] - 10https://gerrit.wikimedia.org/r/977637 (https://phabricator.wikimedia.org/T351621) [10:54:39] (03PS3) 10Btullis: Update default airflow_version and remove overrides [puppet] - 10https://gerrit.wikimedia.org/r/977638 (https://phabricator.wikimedia.org/T351621) [10:54:57] (03CR) 10CI reject: [V: 04-1] Upgrade airflow on the analytics_test instance [puppet] - 10https://gerrit.wikimedia.org/r/978028 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [10:55:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978029 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [10:56:52] (03CR) 10Jbond: vendordata: pin puppet packages to wikimedia repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927664 (https://phabricator.wikimedia.org/T338195) (owner: 10Andrew Bogott) [10:58:58] (03PS2) 10Brouberol: Mention the fact that 'history' is a valid argument in the error message [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) [10:58:59] (JobUnavailable) firing: (5) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T1100) [11:02:56] (03CR) 10Brouberol: Mention the fact that 'history' is a valid argument in the error message (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [11:04:23] (03PS8) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/977184 (https://phabricator.wikimedia.org/T350809) [11:04:57] (JobUnavailable) firing: (6) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:06:07] (03CR) 10Jbond: [C: 03+2] puppet-merge: add prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977184 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [11:06:20] (03CR) 10Jbond: [C: 03+2] puppet-merge: Fix up help message [puppet] - 10https://gerrit.wikimedia.org/r/977185 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [11:06:35] (03CR) 10Jbond: [C: 03+2] puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/977184 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [11:16:30] (03CR) 10Jon Harald Søby: [C: 04-1] zghwiki: add logos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975379 (https://phabricator.wikimedia.org/T350241) (owner: 10Anzx) [11:17:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/978029 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [11:18:21] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10Volans) I had a quick thought about the ENC++ problem as you have named it and I think in the end given a netbox device object (hostname + location + eventually other da... [11:18:56] (03CR) 10Muehlenhoff: [C: 03+2] ganeti/pki: Use chained cert [puppet] - 10https://gerrit.wikimedia.org/r/978029 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [11:18:59] (RedisMemoryFull) firing: Redis memory full on gitlab2002:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_gitlab&var-instance=gitlab2002:9121&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [11:21:01] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:21:21] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:21:32] (03PS1) 10Jgiannelos: Use zap for structured logs on tile pregeneration [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/978030 (https://phabricator.wikimedia.org/T347717) [11:22:03] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:22:18] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:25:25] (03CR) 10Kamila Součková: [C: 03+2] mobileapps: 30% to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976220 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [11:25:35] (03PS2) 10Kamila Součková: mobileapps: 30% to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976220 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [11:26:52] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1257/IPv4: Idle - Tele2, AS1257/IPv6: Idle - Tele2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:27:10] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:27:15] (03PS3) 10Btullis: Upgrade airflow on the analytics_test instance [puppet] - 10https://gerrit.wikimedia.org/r/978028 (https://phabricator.wikimedia.org/T343232) [11:27:17] (03PS4) 10Btullis: Upgrade airflow on the analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/977632 (https://phabricator.wikimedia.org/T351621) [11:27:19] (03PS4) 10Btullis: Upgrade airflow on the search instance [puppet] - 10https://gerrit.wikimedia.org/r/977633 (https://phabricator.wikimedia.org/T351621) [11:27:21] (03PS4) 10Btullis: Upgrade airflow on the research instance [puppet] - 10https://gerrit.wikimedia.org/r/977634 (https://phabricator.wikimedia.org/T351621) [11:27:23] (03PS4) 10Btullis: Upgrade airflow on the platform_eng instance [puppet] - 10https://gerrit.wikimedia.org/r/977635 (https://phabricator.wikimedia.org/T351621) [11:27:25] (03PS4) 10Btullis: Upgrade airflow on the analytics_product [puppet] - 10https://gerrit.wikimedia.org/r/977636 (https://phabricator.wikimedia.org/T351621) [11:27:27] (03PS4) 10Btullis: Upgrade airflow on wmde [puppet] - 10https://gerrit.wikimedia.org/r/977637 (https://phabricator.wikimedia.org/T351621) [11:27:29] (03PS4) 10Btullis: Update default airflow_version and remove overrides [puppet] - 10https://gerrit.wikimedia.org/r/977638 (https://phabricator.wikimedia.org/T351621) [11:27:57] * volans checking the bgp down [11:31:13] (03PS1) 10Hnowlan: changejob-jobqueue: move two more jobs to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/978032 (https://phabricator.wikimedia.org/T349796) [11:33:11] !log volans@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID cr2-esams:xe-0/1/2 [11:33:20] !log volans@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr2-esams:xe-0/1/2 [11:35:19] (03CR) 10Jbond: [V: 03+1 C: 04-1] "The change in pcc was unexpected as the puppetmasteres should not use the server certs. thie diff reminded me of the following issues" [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) (owner: 10Jbond) [11:37:12] (03PS1) 10Jbond: readme: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/978033 [11:37:26] (03CR) 10Jbond: [V: 03+2 C: 03+2] readme: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/978033 (owner: 10Jbond) [11:37:59] (03CR) 10Jon Harald Søby: [C: 04-1] zghwiki: add logos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975379 (https://phabricator.wikimedia.org/T350241) (owner: 10Anzx) [11:39:47] (03PS2) 10Awight: Remove BetaFeature code related to ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch) [11:41:03] (03PS1) 10Jbond: Revert "puppet-merge: add prometheus metrics" [puppet] - 10https://gerrit.wikimedia.org/r/978015 [11:41:07] (03PS1) 10Jbond: Revert "puppet-merge: Fix up help message" [puppet] - 10https://gerrit.wikimedia.org/r/978016 [11:41:25] (03PS2) 10Jbond: Revert "puppet-merge: add prometheus metrics" [puppet] - 10https://gerrit.wikimedia.org/r/978015 [11:41:35] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "puppet-merge: add prometheus metrics" [puppet] - 10https://gerrit.wikimedia.org/r/978015 (owner: 10Jbond) [11:41:53] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:41:57] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:42:12] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:42:16] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:44:36] (03CR) 10Awight: [C: 03+1] "Can remove the PopupsReferencePreviews config as well, see I0179e182408 ." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch) [11:45:18] (03PS1) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) [11:45:20] (03PS5) 10Anzx: zghwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975379 (https://phabricator.wikimedia.org/T350241) [11:45:40] 10SRE-swift-storage, 10Commons, 10Data-Persistence-Backup, 10MediaWiki-File-management, 10media-backups: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10MatthewVernon) I think it's fair to say 12 5GB files a month would not be overwhelming (about 2TB of raw capacity... [11:45:47] (03PS1) 10Jbond: Revert "readme: test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/978018 [11:46:02] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:46:11] (03CR) 10Majavah: [C: 04-1] puppet-merge: add prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [11:46:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 63, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:47:10] (03CR) 10Jbond: [C: 03+2] Revert "readme: test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/978018 (owner: 10Jbond) [11:49:30] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [11:50:06] !log pool ncredir4001 [11:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:38] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.2.4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/978037 [11:52:36] (03PS6) 10Anzx: zghwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975379 (https://phabricator.wikimedia.org/T350241) [11:55:27] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:55:29] (03PS1) 10Vgutierrez: lvs::realserver: Disable RP filter [puppet] - 10https://gerrit.wikimedia.org/r/978038 (https://phabricator.wikimedia.org/T352160) [11:56:03] (03PS2) 10Vgutierrez: lvs::realserver::ipip: Disable RP filter [puppet] - 10https://gerrit.wikimedia.org/r/978038 (https://phabricator.wikimedia.org/T352160) [11:56:15] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:56:33] (03PS7) 10Anzx: zghwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975379 (https://phabricator.wikimedia.org/T350241) [11:57:42] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/735/con" [puppet] - 10https://gerrit.wikimedia.org/r/978038 (https://phabricator.wikimedia.org/T352160) (owner: 10Vgutierrez) [11:57:43] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [11:58:02] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:12] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.2.4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/978037 (owner: 10Volans) [11:58:27] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [11:59:09] (03PS1) 10Jbond: Revert "Revert "readme: test puppet-merge"" [puppet] - 10https://gerrit.wikimedia.org/r/978019 [11:59:26] (03CR) 10Jbond: [C: 03+2] Revert "Revert "readme: test puppet-merge"" [puppet] - 10https://gerrit.wikimedia.org/r/978019 (owner: 10Jbond) [11:59:52] 10SRE, 10Traffic, 10Patch-For-Review: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers - https://phabricator.wikimedia.org/T352160 (10Vgutierrez) [12:01:25] (03CR) 10Jon Harald Søby: [C: 03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975379 (https://phabricator.wikimedia.org/T350241) (owner: 10Anzx) [12:02:06] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:02:28] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:39] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:03:58] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:04] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.2.4 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/978037 (owner: 10Volans) [12:07:35] (03CR) 10WMDE-Fisch: Remove BetaFeature code related to ReferencePreviews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch) [12:08:21] (03PS1) 10Volans: Upstream release v1.2.4 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/978039 [12:08:32] * kart_ planning to update Apertium.. starting with staging.. [12:09:50] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:10] (03CR) 10KartikMistry: [C: 03+2] Update Apertium to 2023-11-23-055425-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977183 (https://phabricator.wikimedia.org/T346997) (owner: 10KartikMistry) [12:11:02] (03Merged) 10jenkins-bot: Update Apertium to 2023-11-23-055425-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977183 (https://phabricator.wikimedia.org/T346997) (owner: 10KartikMistry) [12:12:04] (03PS1) 10Jbond: Revert "Revert "Revert "readme: test puppet-merge""" [puppet] - 10https://gerrit.wikimedia.org/r/978020 [12:12:22] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "Revert "Revert "readme: test puppet-merge""" [puppet] - 10https://gerrit.wikimedia.org/r/978020 (owner: 10Jbond) [12:12:59] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/apertium: apply [12:13:24] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/apertium: apply [12:23:59] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:26:11] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/apertium: apply [12:26:49] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [12:27:43] (03PS3) 10Btullis: Update Presto TLS configuration in production [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) [12:28:51] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/709737 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [12:31:16] 10sre-alert-triage, 10Data-Platform-SRE: Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10LSobanski) [12:32:01] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/apertium: apply [12:32:28] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [12:35:13] !log Updated Apertium to 2023-11-23-055425-production (ie Bookworm!) (T346997) [12:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:21] T346997: Update Apertium service to Bookworm - https://phabricator.wikimedia.org/T346997 [12:38:13] (03PS2) 10Phuedx: Remove mediawiki.web_ui.interactions event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976930 (https://phabricator.wikimedia.org/T351195) [12:40:19] (03PS1) 10Jcrespo: mediabackup: Update mediabackups to use new TLS CA and prepare for 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) [12:40:48] (03CR) 10CI reject: [V: 04-1] mediabackup: Update mediabackups to use new TLS CA and prepare for 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [12:42:06] (03PS2) 10Phuedx: Remove partial migration of VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977986 (https://phabricator.wikimedia.org/T351337) (owner: 10Santiago Faci) [12:47:04] (03PS2) 10Jcrespo: mediabackup: Update mediabackups to use new TLS CA and prepare for 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) [12:47:32] (03CR) 10CI reject: [V: 04-1] mediabackup: Update mediabackups to use new TLS CA and prepare for 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [12:49:06] (03PS3) 10Jcrespo: mediabackup: Update mediabackups to use new TLS CA and prepare for 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) [12:51:39] (03PS4) 10Jcrespo: mediabackup: Update mediabackups to use new TLS CA and prepare for 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) [12:51:56] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [12:52:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2028 T351916', diff saved to https://phabricator.wikimedia.org/P53923 and previous config saved to /var/cache/conftool/dbconfig/20231128-125235-root.json [12:52:49] T351916: Migrate es1 to Bookworm - https://phabricator.wikimedia.org/T351916 [12:52:58] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:53:12] (03PS1) 10Marostegui: Revert "dbproxy1024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978021 [12:53:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:53:54] (03PS1) 10Marostegui: es2028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978046 (https://phabricator.wikimedia.org/T351916) [12:54:26] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978021 (owner: 10Marostegui) [12:54:36] (03CR) 10Marostegui: [C: 03+2] es2028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978046 (https://phabricator.wikimedia.org/T351916) (owner: 10Marostegui) [12:55:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50862 bytes in 9.950 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:55:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.319 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:56:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS bookworm [12:56:27] (03PS5) 10Jcrespo: mediabackup: Update mediabackups to use new TLS CA and prepare for 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) [12:56:41] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [12:57:49] (03CR) 10Volans: [C: 03+2] Upstream release v1.2.4 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/978039 (owner: 10Volans) [12:58:36] !log re-enable sampling on cr1-esams:fpc1 [12:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T1300) [13:01:55] (03PS2) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) [13:01:57] (03PS1) 10Jbond: merge_cli: drop old absent resource [puppet] - 10https://gerrit.wikimedia.org/r/978047 [13:02:36] (03CR) 10CI reject: [V: 04-1] puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [13:05:09] (03Merged) 10jenkins-bot: Upstream release v1.2.4 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/978039 (owner: 10Volans) [13:05:42] (03PS6) 10Jcrespo: mediabackup: Update mediabackups to use new TLS CA and prepare for 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) [13:05:49] (03PS3) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) [13:05:59] (03CR) 10Jbond: puppet-merge: add prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [13:06:15] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [13:06:32] (03CR) 10CI reject: [V: 04-1] puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [13:12:51] (03PS4) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) [13:12:53] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - No response from remote host 185.15.58.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:12:53] (03PS1) 10Jbond: prometheus::node_exporter: allow users to update files they own [puppet] - 10https://gerrit.wikimedia.org/r/978049 [13:13:10] (03CR) 10Jbond: [C: 03+2] merge_cli: drop old absent resource [puppet] - 10https://gerrit.wikimedia.org/r/978047 (owner: 10Jbond) [13:13:30] (03CR) 10CI reject: [V: 04-1] puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [13:15:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [13:16:03] (03PS5) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) [13:16:32] (03CR) 10CI reject: [V: 04-1] puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [13:16:45] (03CR) 10Jbond: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/978049 (owner: 10Jbond) [13:16:59] Hello folks. I've got 3 analytics event stream removal patches scheduled for deployment in the next backport window. Should I squash them to reduce the number of deployments or leave 'em as they are? [13:17:35] (03PS7) 10Jcrespo: mediabackup: Update mediabackups to use new TLS CA and prepare for 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) [13:17:50] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [13:18:03] !log uploaded python3-wmflib_1.2.4 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia,bookworm-wikimedia [13:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:11] (03PS3) 10Phuedx: Remove partial migration of VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977986 (https://phabricator.wikimedia.org/T351337) (owner: 10Santiago Faci) [13:18:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [13:18:53] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:21:04] (03PS8) 10Jcrespo: mediabackup: Update mediabackups to use new TLS CA and prepare for 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) [13:21:11] (03PS6) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) [13:21:15] phuedx: as long as they merge cleanly (so in practice are stacked on top of each other) it's fine. a deployer can deploy multiple patches at once [13:21:22] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [13:23:53] taavi: Huh. Of course they can. I should have thought about that [13:24:57] (JobUnavailable) firing: (6) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:25:55] (03PS2) 10Jbond: prometheus::node_exporter: allow users to update files they own [puppet] - 10https://gerrit.wikimedia.org/r/978049 [13:25:57] (03PS7) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) [13:27:26] (03PS9) 10Jcrespo: mediabackup: Update mediabackups to use new TLS CA and prepare for 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) [13:27:40] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [13:28:26] (03PS1) 10Muehlenhoff: ganeti/pki: Allow configuring an SSL chain file [puppet] - 10https://gerrit.wikimedia.org/r/978051 (https://phabricator.wikimedia.org/T350686) [13:31:16] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10ayounsi) In my mind, trying to be too automatic or too smart here will only cause edge cases issues and complexity to troubleshot. And a too big project to implement. A... [13:32:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978051 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [13:33:00] (03PS8) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) [13:33:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS bookworm [13:34:35] (03CR) 10Jbond: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [13:34:37] (03CR) 10Jcrespo: [C: 04-2] "This is ready for deploy, it just depends on the new package installation." [puppet] - 10https://gerrit.wikimedia.org/r/978042 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [13:35:11] (03CR) 10Muehlenhoff: "Looks good to me. Let's however add the content of the commit message to as a comment, so that it's more obvious to people using the defin" [puppet] - 10https://gerrit.wikimedia.org/r/978049 (owner: 10Jbond) [13:35:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53924 and previous config saved to /var/cache/conftool/dbconfig/20231128-133547-root.json [13:37:24] (03PS2) 10Muehlenhoff: ganeti/pki: Allow configuring an SSL chain file [puppet] - 10https://gerrit.wikimedia.org/r/978051 (https://phabricator.wikimedia.org/T350686) [13:37:52] (03CR) 10CI reject: [V: 04-1] ganeti/pki: Allow configuring an SSL chain file [puppet] - 10https://gerrit.wikimedia.org/r/978051 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [13:40:34] (03PS1) 10Marostegui: Revert "es2028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978022 [13:40:46] (03PS3) 10Muehlenhoff: ganeti/pki: Allow configuring an SSL chain file [puppet] - 10https://gerrit.wikimedia.org/r/978051 (https://phabricator.wikimedia.org/T350686) [13:41:06] (03CR) 10Marostegui: [C: 03+2] Revert "es2028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978022 (owner: 10Marostegui) [13:42:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978051 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [13:44:22] (03PS2) 10Awight: Remove wgPopupsReferencePreviews now that it defaults to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978035 [13:48:38] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:46] (03PS9) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) [13:50:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53925 and previous config saved to /var/cache/conftool/dbconfig/20231128-135052-root.json [13:51:48] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977240 (owner: 10PipelineBot) [13:52:36] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977240 (owner: 10PipelineBot) [13:53:41] (03PS3) 10Anzx: Enable VisualEditor in the Appendix namespace on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978023 (https://phabricator.wikimedia.org/T350926) [13:53:56] (03CR) 10Jbond: prometheus::node_exporter: allow users to update files they own (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/978049 (owner: 10Jbond) [13:55:35] (03CR) 10Jbond: "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/978051 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [13:55:42] (03PS2) 10Phuedx: Remove EchoMail and EchoInteraction event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975394 (https://phabricator.wikimedia.org/T344167) [13:56:42] (03PS3) 10Jbond: prometheus::node_exporter: allow users to update files they own [puppet] - 10https://gerrit.wikimedia.org/r/978049 [13:56:44] (03PS10) 10Jbond: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) [13:59:29] (03PS4) 10Jbond: hiera - cloud: P:grafana now installes httpd so no need to do it seperatly [puppet] - 10https://gerrit.wikimedia.org/r/673038 [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T1400). [14:00:05] phuedx, Dreamy_Jazz, and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] o/ [14:00:21] \o [14:00:50] o/ [14:02:10] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:02:40] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:02:47] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:03:39] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:04:22] (03CR) 10Muehlenhoff: ganeti/pki: Allow configuring an SSL chain file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/978051 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [14:04:38] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [14:04:57] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [14:05:04] o/ [14:05:07] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [14:05:12] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [14:05:15] I can deploy I guess ^^ [14:05:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53926 and previous config saved to /var/cache/conftool/dbconfig/20231128-140557-root.json [14:05:59] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [14:06:34] (03PS3) 10Lucas Werkmeister (WMDE): Remove EchoMail and EchoInteraction event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975394 (https://phabricator.wikimedia.org/T344167) (owner: 10Phuedx) [14:06:42] !log lucaswerkmeister-wmde@deploy2002 Backport cancelled. [14:06:48] actually, hang on [14:06:51] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/976688 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [14:07:12] nah, let’s deploy them separately after all [14:07:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975394 (https://phabricator.wikimedia.org/T344167) (owner: 10Phuedx) [14:07:29] I was thinking about whether they should be combined but looking at the code I don’t feel confident enough for that [14:07:31] let’s just see how far we get [14:08:40] (03Merged) 10jenkins-bot: Remove EchoMail and EchoInteraction event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975394 (https://phabricator.wikimedia.org/T344167) (owner: 10Phuedx) [14:09:06] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:975394|Remove EchoMail and EchoInteraction event streams (T344167)]] [14:09:16] T344167: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 [14:10:30] !log lucaswerkmeister-wmde@deploy2002 phuedx and lucaswerkmeister-wmde: Backport for [[gerrit:975394|Remove EchoMail and EchoInteraction event streams (T344167)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:10:38] !log deploying python3-wmflib_1.2.4 fleet-wide (tested changes on all OSes) [14:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:43] phuedx: can you test this change? [14:11:00] Lucas_WMDE: I can. I can verify that the stream configs are coming through correctly on enwiki, for example [14:11:13] ok [14:12:13] please test, then :) [14:12:58] Lucas_WMDE: Tested. Confirmed that EventLogging is still working as expected (I've seen events being submitted) and that the stream configs have not been sent to the client [14:13:03] ack, thanks! [14:13:05] !log lucaswerkmeister-wmde@deploy2002 phuedx and lucaswerkmeister-wmde: Continuing with sync [14:13:48] (03PS19) 10Jbond: P:netbase: parse the service catalogue and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [14:15:00] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T352124 (10Jhancock.wm) known issue with no impact. will be fixed for good when C row moved to spine/leaf [14:15:04] (03CR) 10Majavah: [C: 03+2] openstack: update wiki replica DNS to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/976688 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [14:15:10] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T352124 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:17:32] (03CR) 10Jbond: P:netbase: parse the service catalogue and inject the service ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [14:18:50] (03CR) 10Jbond: [C: 03+1] "ship it" [puppet] - 10https://gerrit.wikimedia.org/r/978051 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [14:19:12] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:975394|Remove EchoMail and EchoInteraction event streams (T344167)]] (duration: 10m 05s) [14:19:17] T344167: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 [14:19:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:06] (03PS3) 10Lucas Werkmeister (WMDE): Remove mediawiki.web_ui.interactions event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976930 (https://phabricator.wikimedia.org/T351195) (owner: 10Phuedx) [14:20:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976930 (https://phabricator.wikimedia.org/T351195) (owner: 10Phuedx) [14:21:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53927 and previous config saved to /var/cache/conftool/dbconfig/20231128-142102-root.json [14:21:35] (03Merged) 10jenkins-bot: Remove mediawiki.web_ui.interactions event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976930 (https://phabricator.wikimedia.org/T351195) (owner: 10Phuedx) [14:21:59] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:976930|Remove mediawiki.web_ui.interactions event stream (T351195)]] [14:22:04] T351195: WikimediaEvents: Remove partial migration of *UIActions instrument - https://phabricator.wikimedia.org/T351195 [14:22:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:22:59] * volans checking [14:22:59] hello [14:23:06] cked [14:23:21] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and phuedx: Backport for [[gerrit:976930|Remove mediawiki.web_ui.interactions event stream (T351195)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:23:33] thanks [14:23:34] * Lucas_WMDE holding for p.age [14:23:54] phuedx: please test on mwdebug in the meantime [14:26:26] Lucas_WMDE: Tested. Confirmed that EventLogging is still working as expected (I've seen events being submitted) and that the stream configs have not been sent to the client on testwiki (group0) [14:26:33] ok thanks [14:27:06] (not deploying yet, due to that HAProxy alert) [14:27:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:29:47] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:29:52] volans: okay for me to continue deploying? [14:30:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:09] (03CR) 10Muehlenhoff: [C: 03+2] ganeti/pki: Allow configuring an SSL chain file [puppet] - 10https://gerrit.wikimedia.org/r/978051 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [14:31:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:31:54] Lucas_WMDE: I thik it's ok to continue, Amir1 what do you think? [14:32:02] sure [14:32:09] ok thanks [14:32:12] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and phuedx: Continuing with sync [14:34:33] (03PS1) 10Slyngshede: Keymanagement: SSH keys are in some cases not synced to LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/978056 (https://phabricator.wikimedia.org/T351139) [14:36:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53928 and previous config saved to /var/cache/conftool/dbconfig/20231128-143608-root.json [14:38:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/978049 (owner: 10Jbond) [14:38:19] (03PS4) 10Phuedx: Remove partial migration of VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977986 (https://phabricator.wikimedia.org/T351337) (owner: 10Santiago Faci) [14:38:59] (JobUnavailable) firing: (6) Reduced availability for job pybal in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:28] (03CR) 10Phuedx: "PS4 is a manual rebase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977986 (https://phabricator.wikimedia.org/T351337) (owner: 10Santiago Faci) [14:39:35] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:976930|Remove mediawiki.web_ui.interactions event stream (T351195)]] (duration: 17m 36s) [14:39:40] T351195: WikimediaEvents: Remove partial migration of *UIActions instrument - https://phabricator.wikimedia.org/T351195 [14:39:53] (03PS6) 10Jbond: ssl: new ssl module planned to replace ssl_ciphersuite() [puppet] - 10https://gerrit.wikimedia.org/r/640480 (https://phabricator.wikimedia.org/T273743) [14:40:24] (03CR) 10CI reject: [V: 04-1] ssl: new ssl module planned to replace ssl_ciphersuite() [puppet] - 10https://gerrit.wikimedia.org/r/640480 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [14:40:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977986 (https://phabricator.wikimedia.org/T351337) (owner: 10Santiago Faci) [14:41:28] (03Merged) 10jenkins-bot: Remove partial migration of VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977986 (https://phabricator.wikimedia.org/T351337) (owner: 10Santiago Faci) [14:41:49] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:977986|Remove partial migration of VisualEditorFeatureUse instrument (T351337)]] [14:41:55] T351337: Remove partial migration of VisualEditorFeatureUse instrument - https://phabricator.wikimedia.org/T351337 [14:42:03] (03PS7) 10Jbond: ssl: new ssl module planned to replace ssl_ciphersuite() [puppet] - 10https://gerrit.wikimedia.org/r/640480 (https://phabricator.wikimedia.org/T273743) [14:42:33] (03CR) 10CI reject: [V: 04-1] ssl: new ssl module planned to replace ssl_ciphersuite() [puppet] - 10https://gerrit.wikimedia.org/r/640480 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [14:43:10] !log lucaswerkmeister-wmde@deploy2002 sfaci and lucaswerkmeister-wmde: Backport for [[gerrit:977986|Remove partial migration of VisualEditorFeatureUse instrument (T351337)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:43:20] phuedx: please test [14:45:52] Lucas_WMDE: Tested as before. It LGTM :) [14:46:00] ok :) [14:46:01] !log lucaswerkmeister-wmde@deploy2002 sfaci and lucaswerkmeister-wmde: Continuing with sync [14:47:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10CDanis) >>! In T349244#9360471, @Fabfur wrote: > Looping in @CDanis as the original author for the [[ https://gerrit.wikimedia.org/r/c/operations/p... [14:47:36] (03PS1) 10Jbond: puppetserver::rsync_module: explicitly include puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/978058 [14:48:33] h,, O [14:48:38] oops [14:49:19] We get 503s on various wikis [14:49:27] phuedx: your change would also be safe to revert, right? [14:49:28] +1 [14:49:33] I don’t expect it to be the case of those 503s, but still [14:49:39] might as well revert it imho [14:49:49] Lucas_WMDE: Safe to revert [14:50:01] ok, let’s see if I end up doing it or not [14:50:17] (scap still running, currently in php-fpm-restart) [14:50:17] I can open the wikis, got 503 on superset [14:50:23] wikis also back for me [14:50:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:51:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53929 and previous config saved to /var/cache/conftool/dbconfig/20231128-145113-root.json [14:51:50] Will my config change be got to in this window? [14:51:57] maybe [14:51:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/736/console" [puppet] - 10https://gerrit.wikimedia.org/r/978058 (owner: 10Jbond) [14:52:02] aanzx almost certainly not, sorry [14:52:06] but Dreamy_Jazz maybe [14:52:07] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:977986|Remove partial migration of VisualEditorFeatureUse instrument (T351337)]] (duration: 10m 17s) [14:52:13] T351337: Remove partial migration of VisualEditorFeatureUse instrument - https://phabricator.wikimedia.org/T351337 [14:52:23] jouncebot: next [14:52:23] In 1 hour(s) and 7 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T1600) [14:52:46] (03CR) 10Btullis: [C: 03+2] Upgrade airflow on the analytics_test instance [puppet] - 10https://gerrit.wikimedia.org/r/978028 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [14:52:50] My config change should have no-onwiki change as it's needed to pin the value before the config is defined and used in a patch that depends on it. [14:53:22] Lucas_WMDE: ok, will reshedule it [14:54:00] (JobUnavailable) firing: (6) Reduced availability for job pybal in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:39] (03CR) 10Jforrester: [C: 03+1] "ISTR there was some alerting pinging this on testwiki or something? It's very hazy though. Let's see what happens." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977250 (https://phabricator.wikimedia.org/T290759) (owner: 10Ladsgroup) [14:54:41] (03CR) 10Btullis: "I believe that this is ready to go, but we are waiting for a new version of wmfdata-python to be released and deployed:" [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [14:55:05] (03PS1) 10Klausman: ml-serve/istio: Add Restbase as a handled destination for requests from LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/978059 [14:55:08] Received 503 for wikitech [14:55:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:56:13] (03PS2) 10Jbond: puppetserver::rsync_module: explicitly include puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/978058 [14:57:19] (03PS2) 10Klausman: ml-serve/istio: Add Restbase as a handled destination for requests from LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/978059 [14:57:33] 10SRE: 503 Service Unavailable (all Wikimedia sites) - https://phabricator.wikimedia.org/T352182 (10Aklapper) p:05Triage→03Unbreak! [14:57:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:57:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/737/console" [puppet] - 10https://gerrit.wikimedia.org/r/978058 (owner: 10Jbond) [14:57:59] ackd [14:58:21] !log UTC afternoon backport+config window done [14:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:30] (closing a bit early due to the ongoing issues) [14:59:05] (03PS1) 10Ladsgroup: Revert "haproxy: re-set varnish maxconn on all cp hosts" [puppet] - 10https://gerrit.wikimedia.org/r/978025 [14:59:38] (03PS3) 10Klausman: ml-serve/istio: Add Restbase as a handled destination for requests from LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/978059 (https://phabricator.wikimedia.org/T343123) [15:00:24] (03PS4) 10Klausman: ml-serve/istio: Add Restbase as a handled destination for requests from LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/978059 (https://phabricator.wikimedia.org/T343123) [15:02:05] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2011.codfw.wmnet with OS bullseye [15:02:05] (03PS1) 10Cathal Mooney: Set tty serial to com2 for Dell Poweredge R750 model variants [cookbooks] - 10https://gerrit.wikimedia.org/r/978061 (https://phabricator.wikimedia.org/T349936) [15:02:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:02:58] (03PS2) 10Ladsgroup: Revert "haproxy: re-set varnish maxconn on all cp hosts" [puppet] - 10https://gerrit.wikimedia.org/r/978025 [15:03:15] (03CR) 10Vgutierrez: [C: 03+1] Revert "haproxy: re-set varnish maxconn on all cp hosts" [puppet] - 10https://gerrit.wikimedia.org/r/978025 (owner: 10Ladsgroup) [15:03:26] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "haproxy: re-set varnish maxconn on all cp hosts" [puppet] - 10https://gerrit.wikimedia.org/r/978025 (owner: 10Ladsgroup) [15:04:04] (03PS3) 10Jbond: puppetserver::rsync_module: pass the ca_server [puppet] - 10https://gerrit.wikimedia.org/r/978058 [15:04:19] (03PS4) 10Jbond: puppetserver::rsync_module: pass the ca_server [puppet] - 10https://gerrit.wikimedia.org/r/978058 [15:05:29] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:05:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/978058 (owner: 10Jbond) [15:06:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53930 and previous config saved to /var/cache/conftool/dbconfig/20231128-150618-root.json [15:06:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:07:00] (03CR) 10CI reject: [V: 04-1] puppetserver::rsync_module: pass the ca_server [puppet] - 10https://gerrit.wikimedia.org/r/978058 (owner: 10Jbond) [15:07:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/739/console" [puppet] - 10https://gerrit.wikimedia.org/r/978058 (owner: 10Jbond) [15:08:23] (03CR) 10Jbond: puppetserver::rsync_module: pass the ca_server [puppet] - 10https://gerrit.wikimedia.org/r/978058 (owner: 10Jbond) [15:08:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2029.codfw.wmnet with OS bullseye [15:08:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2029.codfw.wmnet with OS bullseye [15:10:43] (03PS1) 10Giuseppe Lavagetto: haproxy: lower maximum concurrent requests to 500 [puppet] - 10https://gerrit.wikimedia.org/r/978062 [15:11:28] (03CR) 10Ladsgroup: [C: 03+1] haproxy: lower maximum concurrent requests to 500 [puppet] - 10https://gerrit.wikimedia.org/r/978062 (owner: 10Giuseppe Lavagetto) [15:12:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] haproxy: lower maximum concurrent requests to 500 [puppet] - 10https://gerrit.wikimedia.org/r/978062 (owner: 10Giuseppe Lavagetto) [15:14:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2096.mgmt.codfw.wmnet with reboot policy FORCED [15:15:10] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs2011.codfw.wmnet with OS bullseye [15:15:27] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2011.codfw.wmnet with OS bullseye [15:24:39] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1161.mgmt.eqiad.wmnet with reboot policy FORCED [15:24:41] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1162.mgmt.eqiad.wmnet with reboot policy FORCED [15:25:12] !log imported ganeti 3.0.2-1~deb11u1+wmf1 to apt.wikimedia.org/bullseye-wikimedia T350686 [15:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:17] T350686: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 [15:25:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2096.mgmt.codfw.wmnet with reboot policy FORCED [15:25:47] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED [15:25:50] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED [15:28:51] (03PS5) 10Jbond: puppetserver::rsync_module: pass the ca_server [puppet] - 10https://gerrit.wikimedia.org/r/978058 (https://phabricator.wikimedia.org/T352156) [15:29:25] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2011.codfw.wmnet with reason: host reimage [15:30:07] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2096'] [15:30:19] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2096'] [15:30:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/978058 (https://phabricator.wikimedia.org/T352156) (owner: 10Jbond) [15:30:31] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2096'] [15:31:32] (03CR) 10CI reject: [V: 04-1] puppetserver::rsync_module: pass the ca_server [puppet] - 10https://gerrit.wikimedia.org/r/978058 (https://phabricator.wikimedia.org/T352156) (owner: 10Jbond) [15:32:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Jhancock.wm) elastic2096 got a space in the serial number somehow. it has been fixed and the provisioning script took. upgrading firmware. [15:32:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Jhancock.wm) [15:32:54] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2011.codfw.wmnet with reason: host reimage [15:33:07] (03PS8) 10Jbond: ssl: new ssl module planned to replace ssl_ciphersuite() [puppet] - 10https://gerrit.wikimedia.org/r/640480 (https://phabricator.wikimedia.org/T273743) [15:33:41] (03PS1) 10Herron: wip [alerts] - 10https://gerrit.wikimedia.org/r/978063 [15:33:48] (03CR) 10Jbond: "Let me know if this is something you would like. if so i can add some tests and follow up patches to cconvert" [puppet] - 10https://gerrit.wikimedia.org/r/640480 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [15:35:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/978061 (https://phabricator.wikimedia.org/T349936) (owner: 10Cathal Mooney) [15:37:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2004.codfw.wmnet [15:38:55] (03PS2) 10Herron: arclamp: redirect alerts to o11y [alerts] - 10https://gerrit.wikimedia.org/r/978063 (https://phabricator.wikimedia.org/T349159) [15:41:22] 10SRE, 10Data-Platform-SRE: Harden the netboot configuration against typos - https://phabricator.wikimedia.org/T351059 (10brouberol) 05Open→03Resolved [15:41:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2004.codfw.wmnet [15:41:38] (03PS6) 10Jbond: puppetserver::rsync_module: pass the ca_server [puppet] - 10https://gerrit.wikimedia.org/r/978058 (https://phabricator.wikimedia.org/T352156) [15:42:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2096'] [15:42:16] 10SRE, 10Data-Platform-SRE: Harden the netboot configuration against typos - https://phabricator.wikimedia.org/T351059 (10MoritzMuehlenhoff) Great work, really useful and well done! [15:43:41] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/978063 (https://phabricator.wikimedia.org/T349159) (owner: 10Herron) [15:43:57] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED [15:43:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED [15:44:39] (03CR) 10Jbond: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/978058 (https://phabricator.wikimedia.org/T352156) (owner: 10Jbond) [15:45:23] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/977738 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [15:46:49] (03CR) 10Herron: [C: 03+1] "Thanks, worth a try for sure!" [puppet] - 10https://gerrit.wikimedia.org/r/977738 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [15:48:23] (03CR) 10JHathaway: [C: 03+2] puppetserver::rsync_module: pass the ca_server [puppet] - 10https://gerrit.wikimedia.org/r/978058 (https://phabricator.wikimedia.org/T352156) (owner: 10Jbond) [15:49:30] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:49:37] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2029.codfw.wmnet with OS bullseye [15:49:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2029.codfw.wmnet with OS bullseye executed with errors: - restbase20... [15:51:58] (03CR) 10Cathal Mooney: [C: 03+2] Set tty serial to com2 for Dell Poweredge R750 model variants [cookbooks] - 10https://gerrit.wikimedia.org/r/978061 (https://phabricator.wikimedia.org/T349936) (owner: 10Cathal Mooney) [15:52:42] 10SRE: 503 Service Unavailable (all Wikimedia sites) - https://phabricator.wikimedia.org/T352182 (10eoghan) Thanks for the report! We're aware of this and currently tracking its status [[ https://www.wikimediastatus.net/incidents/m58dg3ljx3sk | here ]]. We'll update the status page as we know more. [15:54:17] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2011.codfw.wmnet with OS bullseye [15:54:25] RECOVERY - Check systemd state on puppetserver1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:59] !log installing xen security updates [15:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:48] we have Xen somewhere? ;) [15:56:14] (03Merged) 10jenkins-bot: Set tty serial to com2 for Dell Poweredge R750 model variants [cookbooks] - 10https://gerrit.wikimedia.org/r/978061 (https://phabricator.wikimedia.org/T349936) (owner: 10Cathal Mooney) [15:58:12] 10SRE, 10Wikimedia-Incident: 503 Service Unavailable (all Wikimedia sites) - https://phabricator.wikimedia.org/T352182 (10Aklapper) [16:00:05] eoghan, jelto, and arnoldokoth: gettimeofday() says it's time for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T1600) [16:00:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/741/console" [puppet] - 10https://gerrit.wikimedia.org/r/978058 (https://phabricator.wikimedia.org/T352156) (owner: 10Jbond) [16:01:25] RECOVERY - Check systemd state on puppetserver1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:52] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2011.codfw.wmnet [16:01:53] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2011.codfw.wmnet [16:02:19] 10SRE, 10Wikimedia-Incident: 503 Service Unavailable (all Wikimedia sites) - https://phabricator.wikimedia.org/T352182 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup This has been resolved as seen in the status page. Thank you for reporting! [16:02:35] (03CR) 10Elukey: [C: 03+1] ml-serve/istio: Add Restbase as a handled destination for requests from LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/978059 (https://phabricator.wikimedia.org/T343123) (owner: 10Klausman) [16:02:51] (03PS3) 10Vgutierrez: hiera: Disable rp filter on ncredir@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/978038 (https://phabricator.wikimedia.org/T352160) [16:02:53] (03PS1) 10Vgutierrez: base::sysctl: Allow disabling rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/978088 (https://phabricator.wikimedia.org/T352160) [16:04:25] RECOVERY - Check systemd state on puppetserver2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:31] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2012.codfw.wmnet with OS bullseye [16:05:17] 10SRE, 10Wikimedia-Incident: 503 Service Unavailable (all Wikimedia sites) - https://phabricator.wikimedia.org/T352182 (10CDanis) Users are no longer impacted. Unfortunately due to the nature of the outage (a DDoS) we can't publish more details. [16:05:28] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [16:07:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10VRiley-WMF) [16:07:40] !log installing distro-info-data updates [16:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:45] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:11:49] RECOVERY - Check systemd state on puppetserver2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:12:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:14:30] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [16:17:24] (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-serve/istio: Add Restbase as a handled destination for requests from LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/978059 (https://phabricator.wikimedia.org/T343123) (owner: 10Klausman) [16:17:38] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs2012.codfw.wmnet with OS bullseye [16:17:50] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2012.codfw.wmnet with OS bullseye [16:20:04] (03Merged) 10jenkins-bot: ml-serve/istio: Add Restbase as a handled destination for requests from LW [deployment-charts] - 10https://gerrit.wikimedia.org/r/978059 (https://phabricator.wikimedia.org/T343123) (owner: 10Klausman) [16:22:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10VRiley-WMF) [16:23:02] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:23:20] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:24:57] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:26:43] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/742/console" [puppet] - 10https://gerrit.wikimedia.org/r/978088 (https://phabricator.wikimedia.org/T352160) (owner: 10Vgutierrez) [16:28:25] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/743/con" [puppet] - 10https://gerrit.wikimedia.org/r/978038 (https://phabricator.wikimedia.org/T352160) (owner: 10Vgutierrez) [16:29:34] 10SRE, 10ops-eqiad: decommission flerovium - https://phabricator.wikimedia.org/T352193 (10MoritzMuehlenhoff) [16:30:24] (03PS1) 10Sbisson: Configure wiki-highlights experiment stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978096 (https://phabricator.wikimedia.org/T348613) [16:34:43] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs2012.codfw.wmnet with OS bullseye [16:35:36] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2012.codfw.wmnet with OS bullseye [16:35:46] (03CR) 10Bearloga: [C: 03+1] Configure wiki-highlights experiment stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978096 (https://phabricator.wikimedia.org/T348613) (owner: 10Sbisson) [16:37:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/978088 (https://phabricator.wikimedia.org/T352160) (owner: 10Vgutierrez) [16:39:10] (03PS1) 10Klausman: ml-serve/istio: fix wrong port in destination rule for restgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/978098 (https://phabricator.wikimedia.org/T343123) [16:39:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2029.codfw.wmnet with OS bullseye [16:40:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2029.codfw.wmnet with OS bullseye [16:44:20] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase2029.codfw.wmnet with OS bullseye [16:44:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2029.codfw.wmnet with OS bullseye executed with errors: - restbase20... [16:44:32] (03PS5) 10Jbond: (do not merge) Testing get useres function [puppet] - 10https://gerrit.wikimedia.org/r/690367 [16:46:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2029.codfw.wmnet with OS bullseye [16:46:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2029.codfw.wmnet with OS bullseye [16:49:23] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2012.codfw.wmnet with reason: host reimage [16:49:52] (03PS5) 10Jbond: admin::get_users: add function to get a list of configured users [puppet] - 10https://gerrit.wikimedia.org/r/690366 [16:49:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2030.codfw.wmnet with OS bullseye [16:49:54] (03PS6) 10Jbond: (do not merge) Testing get users function [puppet] - 10https://gerrit.wikimedia.org/r/690367 [16:50:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2030.codfw.wmnet with OS bullseye [16:50:26] (03CR) 10CI reject: [V: 04-1] (do not merge) Testing get users function [puppet] - 10https://gerrit.wikimedia.org/r/690367 (owner: 10Jbond) [16:51:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/744/con" [puppet] - 10https://gerrit.wikimedia.org/r/690367 (owner: 10Jbond) [16:52:15] (03PS6) 10Jbond: admin::get_users: add function to get a list of configured users [puppet] - 10https://gerrit.wikimedia.org/r/690366 [16:52:37] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2012.codfw.wmnet with reason: host reimage [16:53:12] (03CR) 10Jbond: "This seems finished let me know what you think" [puppet] - 10https://gerrit.wikimedia.org/r/690366 (owner: 10Jbond) [16:53:42] (03Abandoned) 10Jbond: CP1079: revert combined CA [puppet] - 10https://gerrit.wikimedia.org/r/689141 (owner: 10Jbond) [16:54:02] (03CR) 10CI reject: [V: 04-1] admin::get_users: add function to get a list of configured users [puppet] - 10https://gerrit.wikimedia.org/r/690366 (owner: 10Jbond) [16:54:28] (03Abandoned) 10Jbond: exim: make exim class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/688333 (https://phabricator.wikimedia.org/T232343) (owner: 10Jbond) [16:54:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10eoghan) 05Open→03Resolved a:03eoghan The point of contact/dates have been updated and confirmed that access is working as required! [16:57:12] (03CR) 10Elukey: [C: 03+1] ml-serve/istio: fix wrong port in destination rule for restgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/978098 (https://phabricator.wikimedia.org/T343123) (owner: 10Klausman) [16:58:12] (03CR) 10CI reject: [V: 04-1] admin::get_users: add function to get a list of configured users [puppet] - 10https://gerrit.wikimedia.org/r/690366 (owner: 10Jbond) [16:59:08] (03PS2) 10Jbond: P:tcpircbot: delete a previoulsy absented resource [puppet] - 10https://gerrit.wikimedia.org/r/673229 [17:00:03] (03CR) 10Jbond: [C: 03+2] P:tcpircbot: delete a previoulsy absented resource [puppet] - 10https://gerrit.wikimedia.org/r/673229 (owner: 10Jbond) [17:00:05] jhathaway and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T1700). nyaa~ [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:02:09] (03PS7) 10Jbond: admin::get_users: add function to get a list of configured users [puppet] - 10https://gerrit.wikimedia.org/r/690366 [17:03:56] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [17:04:07] (03Abandoned) 10Jbond: role and profile specs: add example spec test [puppet] - 10https://gerrit.wikimedia.org/r/642423 (owner: 10Jbond) [17:04:11] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:04:37] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [17:04:55] (03Abandoned) 10Jbond: P:analytics::cluster::packages::common: demostrate failing spec test [puppet] - 10https://gerrit.wikimedia.org/r/644786 (https://phabricator.wikimedia.org/T261693) (owner: 10Jbond) [17:04:57] (03CR) 10Jbond: [C: 03+2] P:analytics::cluster::packages::common: Add simple spec test [puppet] - 10https://gerrit.wikimedia.org/r/644312 (https://phabricator.wikimedia.org/T261693) (owner: 10Jbond) [17:09:07] (03PS6) 10Jbond: profile::maps::tlsproxy: update profile to use envoy for tls termination [puppet] - 10https://gerrit.wikimedia.org/r/585248 [17:10:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/745/con" [puppet] - 10https://gerrit.wikimedia.org/r/585248 (owner: 10Jbond) [17:10:42] (03CR) 10Jbond: "@hnowlan please let me know if this is something you want or if i should abbandon" [puppet] - 10https://gerrit.wikimedia.org/r/585248 (owner: 10Jbond) [17:11:36] (03PS1) 10Papaul: Add restbase2029 to apt_repo.yaml [puppet] - 10https://gerrit.wikimedia.org/r/978100 (https://phabricator.wikimedia.org/T349758) [17:12:14] (03CR) 10Papaul: [C: 03+2] Add restbase2029 to apt_repo.yaml [puppet] - 10https://gerrit.wikimedia.org/r/978100 (https://phabricator.wikimedia.org/T349758) (owner: 10Papaul) [17:12:32] (03PS2) 10Jbond: Revert "P:pki::get_cert: add calling module to fail message" [puppet] - 10https://gerrit.wikimedia.org/r/694307 [17:13:47] (03Abandoned) 10Jbond: Revert "P:pki::get_cert: add calling module to fail message" [puppet] - 10https://gerrit.wikimedia.org/r/694307 (owner: 10Jbond) [17:14:36] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2012.codfw.wmnet with OS bullseye [17:14:56] (03Abandoned) 10Jbond: O:envoyproxy: add a way to restart envoy proxy [puppet] - 10https://gerrit.wikimedia.org/r/694379 (owner: 10Jbond) [17:17:26] (03CR) 10Hnowlan: [C: 03+1] "Looks great, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/585248 (owner: 10Jbond) [17:23:35] (03Abandoned) 10Jbond: role::puppetmaster::standalone: add type checking to autosign [puppet] - 10https://gerrit.wikimedia.org/r/566512 (https://phabricator.wikimedia.org/T284082) (owner: 10Jbond) [17:27:46] (03Abandoned) 10Jbond: Revert "P:cumin::master: Use sudo to run the check command" [puppet] - 10https://gerrit.wikimedia.org/r/699169 (owner: 10Jbond) [17:28:01] (03Abandoned) 10Jbond: (DO NOT merge) P:sretest: test change to check hiera dot notation [puppet] - 10https://gerrit.wikimedia.org/r/697629 (https://phabricator.wikimedia.org/T256221) (owner: 10Jbond) [17:32:12] (03PS1) 10Papaul: Testng why insller partman isn't working on those nodes [puppet] - 10https://gerrit.wikimedia.org/r/978101 (https://phabricator.wikimedia.org/T349758) [17:33:26] (03PS2) 10Papaul: Testng why partman isn't working on those nodes [puppet] - 10https://gerrit.wikimedia.org/r/978101 (https://phabricator.wikimedia.org/T349758) [17:33:35] (03Abandoned) 10Jbond: C:locales: Add ability to customise the installed locales [puppet] - 10https://gerrit.wikimedia.org/r/700206 (https://phabricator.wikimedia.org/T285086) (owner: 10Jbond) [17:35:25] (03CR) 10Jbond: [V: 03+1] "its shame this one got dropped but it would be a really good one to pick up again" [puppet] - 10https://gerrit.wikimedia.org/r/702325 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [17:35:40] (03PS2) 10Jbond: cloud dev - hiera: add wmflib::expand_path to codfw1dev hiera [puppet] - 10https://gerrit.wikimedia.org/r/702325 (https://phabricator.wikimedia.org/T285539) [17:35:42] (03PS2) 10Jbond: cloud - hiera: add wmflib::expand_path to hiera [puppet] - 10https://gerrit.wikimedia.org/r/702326 (https://phabricator.wikimedia.org/T285539) [17:35:52] (03CR) 10Papaul: [C: 03+2] Testng why partman isn't working on those nodes [puppet] - 10https://gerrit.wikimedia.org/r/978101 (https://phabricator.wikimedia.org/T349758) (owner: 10Papaul) [17:37:04] (03CR) 10Ssingh: [C: 03+1] base::sysctl: Allow disabling rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/978088 (https://phabricator.wikimedia.org/T352160) (owner: 10Vgutierrez) [17:38:04] (03Abandoned) 10Jbond: utils/run_ci_localy.sh: run CI locally [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [17:38:29] (03PS3) 10Jbond: P:envoy::builder: disable timer logging [puppet] - 10https://gerrit.wikimedia.org/r/716222 [17:39:41] (03CR) 10Ssingh: [C: 03+1] hiera: Disable rp filter on ncredir@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/978038 (https://phabricator.wikimedia.org/T352160) (owner: 10Vgutierrez) [17:43:47] (03CR) 10Ssingh: [C: 03+1] "LGTM! I think we should a separate bug for the decomm though; something like https://phabricator.wikimedia.org/T344363. This makes it easi" [puppet] - 10https://gerrit.wikimedia.org/r/977702 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [17:44:35] (03CR) 10Jbond: [C: 03+2] P:envoy::builder: disable timer logging [puppet] - 10https://gerrit.wikimedia.org/r/716222 (owner: 10Jbond) [17:45:54] (03Abandoned) 10Jbond: openstack: quick PoC porting wmcs-enc-cli to a spicerack module [software/spicerack] - 10https://gerrit.wikimedia.org/r/663826 (owner: 10Jbond) [17:47:27] (03PS2) 10Klausman: ml-serve/istio: fix wrong port in destination rule for restgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/978098 (https://phabricator.wikimedia.org/T343123) [17:49:06] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [17:49:09] (03PS2) 10Jbond: lldp: fix confine [puppet] - 10https://gerrit.wikimedia.org/r/716440 [17:50:36] (03Abandoned) 10Jbond: lldp: fix confine [puppet] - 10https://gerrit.wikimedia.org/r/716440 (owner: 10Jbond) [17:51:08] (03PS2) 10Jbond: P:java: add spec tests for java profile [puppet] - 10https://gerrit.wikimedia.org/r/719109 [17:51:27] (03PS3) 10Klausman: ml-serve/istio: fix wrong port in destination rule for restgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/978098 (https://phabricator.wikimedia.org/T343123) [17:52:34] (03CR) 10Elukey: [C: 03+1] ml-serve/istio: fix wrong port in destination rule for restgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/978098 (https://phabricator.wikimedia.org/T343123) (owner: 10Klausman) [17:52:51] (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-serve/istio: fix wrong port in destination rule for restgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/978098 (https://phabricator.wikimedia.org/T343123) (owner: 10Klausman) [17:53:07] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:53:40] (03CR) 10CI reject: [V: 04-1] P:java: add spec tests for java profile [puppet] - 10https://gerrit.wikimedia.org/r/719109 (owner: 10Jbond) [17:55:13] (03CR) 10Jbond: [C: 04-1] "@jesse i think this will still suffer from the same issues we had with https://github.com/voxpupuli/puppet-prometheus_reporter and if we " [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [17:55:35] (03CR) 10CDanis: [C: 03+1] "This looks correct to me." [puppet] - 10https://gerrit.wikimedia.org/r/975009 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:56:13] (03Merged) 10jenkins-bot: ml-serve/istio: fix wrong port in destination rule for restgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/978098 (https://phabricator.wikimedia.org/T343123) (owner: 10Klausman) [17:56:37] (03PS3) 10Jbond: P:java: add spec tests for java profile [puppet] - 10https://gerrit.wikimedia.org/r/719109 [17:56:53] (03PS3) 10Jbond: apt::pin: include apt [puppet] - 10https://gerrit.wikimedia.org/r/723520 [18:00:07] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T1800) [18:04:57] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [18:05:29] (03PS6) 10Jbond: (WIP) interface: try to update the numa integrations [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) [18:05:31] (03PS5) 10Jbond: systemd: Add support for setting cpu affinity [puppet] - 10https://gerrit.wikimedia.org/r/662775 (https://phabricator.wikimedia.org/T236208) [18:06:02] (03PS7) 10Jbond: interface: try to update the numa integrations [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) [18:06:04] (03PS6) 10Jbond: systemd: Add support for setting cpu affinity [puppet] - 10https://gerrit.wikimedia.org/r/662775 (https://phabricator.wikimedia.org/T236208) [18:06:52] (03CR) 10CI reject: [V: 04-1] interface: try to update the numa integrations [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [18:06:57] !log milimetric@deploy2002 Started deploy [analytics/refinery@72ec207]: hotfix for webrequest refine [18:07:33] (03CR) 10Jbond: "this still seems useful and worth pursuing to me but would need input from traffic" [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [18:08:35] (03CR) 10CI reject: [V: 04-1] systemd: Add support for setting cpu affinity [puppet] - 10https://gerrit.wikimedia.org/r/662775 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [18:08:50] (03CR) 10Jbond: [C: 03+2] P:java: add spec tests for java profile [puppet] - 10https://gerrit.wikimedia.org/r/719109 (owner: 10Jbond) [18:09:53] (03Abandoned) 10Jbond: apt::pin: include apt [puppet] - 10https://gerrit.wikimedia.org/r/723520 (owner: 10Jbond) [18:10:29] (03Abandoned) 10Jbond: "P:base: drop broad dependency" [puppet] - 10https://gerrit.wikimedia.org/r/725276 (owner: 10Jbond) [18:10:47] (03PS2) 10Jbond: r_lang: fix use of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726882 [18:10:56] (03Abandoned) 10Jbond: r_lang: fix use of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726882 (owner: 10Jbond) [18:11:04] (03PS1) 10Papaul: Update partman for new restbase node [puppet] - 10https://gerrit.wikimedia.org/r/978115 (https://phabricator.wikimedia.org/T349758) [18:11:35] (03CR) 10CI reject: [V: 04-1] Update partman for new restbase node [puppet] - 10https://gerrit.wikimedia.org/r/978115 (https://phabricator.wikimedia.org/T349758) (owner: 10Papaul) [18:12:26] (03CR) 10Jbond: "@jesse not sure what you think about this. either way it would need a refresh" [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [18:12:58] (03Abandoned) 10Jbond: changelog: fix distro [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732941 (owner: 10Jbond) [18:13:23] (03Abandoned) 10Jbond: C:scap::scripts: ensure we inlucde scap::master [puppet] - 10https://gerrit.wikimedia.org/r/734996 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [18:13:49] (03PS2) 10Jbond: P:httpbb: check for basicauth_credentials using defined [puppet] - 10https://gerrit.wikimedia.org/r/734999 (https://phabricator.wikimedia.org/T294435) [18:15:05] (03PS2) 10Papaul: Update partman for new restbase node [puppet] - 10https://gerrit.wikimedia.org/r/978115 (https://phabricator.wikimedia.org/T349758) [18:15:13] (03PS1) 10Bking: miscweb: fix typo in wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/978118 (https://phabricator.wikimedia.org/T347355) [18:15:28] (03PS1) 10Ssingh: tests: add schema for dnsbox [software/conftool] - 10https://gerrit.wikimedia.org/r/978119 [18:15:30] (03CR) 10Jbond: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/734999 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [18:15:45] !log milimetric@deploy2002 Finished deploy [analytics/refinery@72ec207]: hotfix for webrequest refine (duration: 08m 47s) [18:16:15] !log milimetric@deploy2002 Started deploy [analytics/refinery@72ec207] (thin): hotfix for webrequest refine [18:16:23] !log milimetric@deploy2002 Finished deploy [analytics/refinery@72ec207] (thin): hotfix for webrequest refine (duration: 00m 07s) [18:17:19] (03CR) 10Papaul: [C: 03+2] Update partman for new restbase node [puppet] - 10https://gerrit.wikimedia.org/r/978115 (https://phabricator.wikimedia.org/T349758) (owner: 10Papaul) [18:18:19] (03Abandoned) 10Jbond: yamllint: test yamllint CI [puppet] - 10https://gerrit.wikimedia.org/r/557061 (https://phabricator.wikimedia.org/T236954) (owner: 10Jbond) [18:18:22] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978118 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [18:19:23] (03Abandoned) 10Jbond: (test) migrate sretest to new role_data profile [puppet] - 10https://gerrit.wikimedia.org/r/692636 (owner: 10Jbond) [18:21:24] (03PS2) 10Jbond: (WIP) Implement json logging [puppet] - 10https://gerrit.wikimedia.org/r/694493 [18:23:20] (03CR) 10Jbond: "@jesse im guessing the plan here was to make it easier to parse things in OpenSearch (logstash). I don't remember much more context let m" [puppet] - 10https://gerrit.wikimedia.org/r/694493 (owner: 10Jbond) [18:24:06] (03CR) 10Dzahn: [C: 03+1] miscweb: fix typo in wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/978118 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [18:24:31] (03CR) 10Bking: [C: 03+2] miscweb: fix typo in wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/978118 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [18:24:35] (03CR) 10Dzahn: [C: 03+1] "Douglas Adams :) https://www.wikidata.org/wiki/Q42" [puppet] - 10https://gerrit.wikimedia.org/r/978118 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [18:24:56] (03CR) 10CDanis: [C: 03+2] tests: add schema for dnsbox [software/conftool] - 10https://gerrit.wikimedia.org/r/978119 (owner: 10Ssingh) [18:25:02] (03CR) 10Andrew Bogott: "This seems great, although I await the opinion of the pcc." [puppet] - 10https://gerrit.wikimedia.org/r/702326 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [18:26:30] (03PS3) 10Jbond: P:tlsproxy::instance: Drop numa_networking global [puppet] - 10https://gerrit.wikimedia.org/r/724733 (https://phabricator.wikimedia.org/T263578) [18:27:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, 10Patch-For-Review: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) new restbase nodes is using different partman recipe then the one in apt_repo.yaml file so adding the right partman recipe for the new node... [18:27:55] (03PS2) 10Jbond: R:icingamonitor::elasticsearch::cirrus_settings_check [puppet] - 10https://gerrit.wikimedia.org/r/735015 (https://phabricator.wikimedia.org/T294435) [18:28:03] (03Merged) 10jenkins-bot: tests: add schema for dnsbox [software/conftool] - 10https://gerrit.wikimedia.org/r/978119 (owner: 10Ssingh) [18:28:32] (03PS2) 10Jbond: R:mtail::program: WIP - i dont think we need to manage metaparamters [puppet] - 10https://gerrit.wikimedia.org/r/735021 (https://phabricator.wikimedia.org/T294435) [18:29:14] (03CR) 10EoghanGaffney: [C: 03+1] aphlict: move 'puppet-controlled config' to role-level [puppet] - 10https://gerrit.wikimedia.org/r/974668 (owner: 10Dzahn) [18:29:28] (03PS3) 10Jbond: R:mtail::program: dont manage metaparamters [puppet] - 10https://gerrit.wikimedia.org/r/735021 (https://phabricator.wikimedia.org/T294435) [18:30:34] (03PS2) 10Jbond: R:varnish::wikimedia_vcl: drop undefined metaparameters [puppet] - 10https://gerrit.wikimedia.org/r/735005 (https://phabricator.wikimedia.org/T294435) [18:30:57] (03PS3) 10Jbond: R:varnish::wikimedia_vcl: drop undefined metaparameters [puppet] - 10https://gerrit.wikimedia.org/r/735005 (https://phabricator.wikimedia.org/T294435) [18:31:23] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/735005 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [18:33:15] (03CR) 10Jbond: "@jesse no idea what this was for, inclination is to abandon but let me know if you think this is useful" [puppet] - 10https://gerrit.wikimedia.org/r/736212 (owner: 10Jbond) [18:33:37] (03Abandoned) 10Jbond: r10k: create r10k production environment [puppet] - 10https://gerrit.wikimedia.org/r/736425 (owner: 10Jbond) [18:34:09] (03Abandoned) 10Jbond: (WIP) to test CI [puppet] - 10https://gerrit.wikimedia.org/r/736505 (owner: 10Jbond) [18:34:21] (03Abandoned) 10Jbond: DO NOT MEREG - broken commit for testing pcc changes [puppet] - 10https://gerrit.wikimedia.org/r/736729 (owner: 10Jbond) [18:34:23] (03PS1) 10Papaul: adding new restbase to the file is breaking puppet on the apt server [puppet] - 10https://gerrit.wikimedia.org/r/978122 (https://phabricator.wikimedia.org/T349758) [18:34:39] (03CR) 10CI reject: [V: 04-1] adding new restbase to the file is breaking puppet on the apt server [puppet] - 10https://gerrit.wikimedia.org/r/978122 (https://phabricator.wikimedia.org/T349758) (owner: 10Papaul) [18:38:52] (03PS2) 10Jbond: P:trafficserver::backend: use ca provided by P:base::certificates [puppet] - 10https://gerrit.wikimedia.org/r/737408 (https://phabricator.wikimedia.org/T291905) [18:39:19] (03CR) 10CI reject: [V: 04-1] P:trafficserver::backend: use ca provided by P:base::certificates [puppet] - 10https://gerrit.wikimedia.org/r/737408 (https://phabricator.wikimedia.org/T291905) (owner: 10Jbond) [18:39:43] (03Abandoned) 10Jbond: P:trafficserver::backend: use ca provided by P:base::certificates [puppet] - 10https://gerrit.wikimedia.org/r/737408 (https://phabricator.wikimedia.org/T291905) (owner: 10Jbond) [18:40:50] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: add a clouds.yaml file for environment setup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [18:41:34] (03CR) 10Andrew Bogott: [C: 03+2] openstack::cinder::user: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/881885 (owner: 10Muehlenhoff) [18:43:34] (03CR) 10Andrew Bogott: [C: 03+2] ldap: client: auto-restart sssd-nss on failure [puppet] - 10https://gerrit.wikimedia.org/r/970728 (https://phabricator.wikimedia.org/T349687) (owner: 10Majavah) [18:44:23] (03PS2) 10Jbond: adding new restbase to the file is breaking puppet on the apt server [puppet] - 10https://gerrit.wikimedia.org/r/978122 (https://phabricator.wikimedia.org/T349758) (owner: 10Papaul) [18:47:49] (03CR) 10Papaul: [C: 03+2] adding new restbase to the file is breaking puppet on the apt server [puppet] - 10https://gerrit.wikimedia.org/r/978122 (https://phabricator.wikimedia.org/T349758) (owner: 10Papaul) [18:50:16] (03Abandoned) 10Jbond: fix netbase [puppet] - 10https://gerrit.wikimedia.org/r/737736 (owner: 10Jbond) [18:51:29] (03Abandoned) 10Jbond: otrs_aliases: sort and unique emails [puppet] - 10https://gerrit.wikimedia.org/r/739825 (owner: 10Jbond) [18:51:46] (03Abandoned) 10Jbond: WIP: test [puppet] - 10https://gerrit.wikimedia.org/r/740139 (owner: 10Jbond) [18:52:12] code coverage bot is dying a lot lately [18:53:41] (03Abandoned) 10Jbond: puppetboard - service: update puppetboard live check [puppet] - 10https://gerrit.wikimedia.org/r/741146 (owner: 10Jbond) [18:54:57] (JobUnavailable) firing: (5) Reduced availability for job pybal in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:56:56] (03Abandoned) 10Jbond: O:puppet_compiler::puppetdb: Add role for puppetdb compiler (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/741850 (owner: 10Jbond) [18:59:04] (03CR) 10Jbond: "im going to abandon this change as its incomplete but wanted to tag you both as a pointer to what i was thinking" [puppet] - 10https://gerrit.wikimedia.org/r/745992 (owner: 10Jbond) [18:59:12] (03Abandoned) 10Jbond: WIP: move towards asyncio [puppet] - 10https://gerrit.wikimedia.org/r/745992 (owner: 10Jbond) [19:00:08] hashar and jeena: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T1900). [19:00:33] (03Abandoned) 10Jbond: O:puppetmaster: Add age::store to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/747194 (owner: 10Jbond) [19:00:53] (03Abandoned) 10Jbond: Revert "mirrors.wikimedia.org: Add new mirror server to dmz_cidr" [puppet] - 10https://gerrit.wikimedia.org/r/748284 (owner: 10Jbond) [19:04:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/881885 (owner: 10Muehlenhoff) [19:07:02] (03PS2) 10Jbond: C:statistics::compute: correct user param [puppet] - 10https://gerrit.wikimedia.org/r/735029 (https://phabricator.wikimedia.org/T294435) [19:07:25] (03PS1) 10Cathal Mooney: Reverse logic to select correct virtual console serial mode [cookbooks] - 10https://gerrit.wikimedia.org/r/978127 [19:07:30] (03CR) 10CI reject: [V: 04-1] C:statistics::compute: correct user param [puppet] - 10https://gerrit.wikimedia.org/r/735029 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [19:08:27] (03PS3) 10Jbond: C:statistics::compute: correct user param [puppet] - 10https://gerrit.wikimedia.org/r/735029 (https://phabricator.wikimedia.org/T294435) [19:08:43] (03PS2) 10Cathal Mooney: Reverse logic to select correct virtual console serial mode [cookbooks] - 10https://gerrit.wikimedia.org/r/978127 [19:09:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/735029 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [19:10:01] (03Abandoned) 10Jbond: Revert "bgpalerter: update hiera" [puppet] - 10https://gerrit.wikimedia.org/r/753630 (owner: 10Jbond) [19:12:07] (03Abandoned) 10Jbond: O:mail::mx: Add mx specific block list [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [19:18:35] (03PS6) 10Jbond: P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257 [19:21:11] PROBLEM - Check systemd state on planet1003 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-cs.service,planet-update-de.service,planet-update-en.service,planet-update-es.service,planet-update-fr.service,planet-update-gmq.service,planet-update-it.service,planet-update-pl.service,planet-update-sq.service,planet-update-uk.service,planet-update-zh.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd [19:22:01] (03PS7) 10Jbond: P:base::production: move system::role to profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/765257 [19:23:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/749/con" [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond) [19:23:59] (ProbeDown) firing: (2) Service planet1003:443 has failed probes (http_en_planet_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:25:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1161.mgmt.eqiad.wmnet with reboot policy FORCED [19:25:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1162.mgmt.eqiad.wmnet with reboot policy FORCED [19:25:38] (03CR) 10Jbond: [V: 03+1] "let me know what you think of this" [puppet] - 10https://gerrit.wikimedia.org/r/765257 (owner: 10Jbond) [19:26:00] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED [19:26:01] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1161.mgmt.eqiad.wmnet with reboot policy FORCED [19:26:02] (03Abandoned) 10Jbond: DO NOT MERGE: demo rake task [puppet] - 10https://gerrit.wikimedia.org/r/791601 (owner: 10Jbond) [19:26:02] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED [19:26:04] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1162.mgmt.eqiad.wmnet with reboot policy FORCED [19:28:33] (03PS5) 10Jbond: P:gerrit: Add logoutd script for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) [19:29:31] (03CR) 10Jbond: "@moritz i can;t remember what we decided about this in the meting but looks like we have something pretty much already working" [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) (owner: 10Jbond) [19:30:02] (03Abandoned) 10Jbond: C:monitoring::check::http: move config to config ini file [puppet] - 10https://gerrit.wikimedia.org/r/786384 (owner: 10Jbond) [19:31:03] (03CR) 10CI reject: [V: 04-1] P:gerrit: Add logoutd script for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) (owner: 10Jbond) [19:31:53] (03PS1) 10CDanis: [aux-k8s-eqiad] add kube-state-metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/978129 (https://phabricator.wikimedia.org/T264625) [19:32:40] (03CR) 10Jbond: "@Emperor let me know if this is useful or should i just abandon this?" [puppet] - 10https://gerrit.wikimedia.org/r/773794 (owner: 10Jbond) [19:32:51] (03PS2) 10CDanis: [aux-k8s-eqiad] add kube-state-metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/978129 (https://phabricator.wikimedia.org/T264625) [19:35:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED [19:35:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED [19:35:40] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1165.mgmt.eqiad.wmnet with reboot policy FORCED [19:35:43] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1166.mgmt.eqiad.wmnet with reboot policy FORCED [19:37:39] (03PS2) 10Jbond: R:envoyproxy::tls_terminator: Add support for ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/771942 [19:38:57] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1167.mgmt.eqiad.wmnet with reboot policy FORCED [19:38:59] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1168.mgmt.eqiad.wmnet with reboot policy FORCED [19:39:55] (03CR) 10CI reject: [V: 04-1] R:envoyproxy::tls_terminator: Add support for ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/771942 (owner: 10Jbond) [19:41:19] (03PS3) 10Jbond: R:envoyproxy::tls_terminator: Add support for ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/771942 (https://phabricator.wikimedia.org/T303820) [19:42:19] (03CR) 10Jbond: "let me know if you think this is worth pursuing if if i should abandon" [puppet] - 10https://gerrit.wikimedia.org/r/771942 (https://phabricator.wikimedia.org/T303820) (owner: 10Jbond) [19:43:18] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1162.mgmt.eqiad.wmnet with reboot policy FORCED [19:43:31] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1161.mgmt.eqiad.wmnet with reboot policy FORCED [19:44:56] (03PS2) 10Jbond: P:scap::dsh: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/771853 [19:47:06] (03CR) 10Bartosz Dziewoński: [C: 03+1] mobile: Remove $wgMobileUrlTemplate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977791 (owner: 10Gergő Tisza) [19:47:15] (03PS3) 10Jbond: P:scap::dsh: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/771853 [19:49:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1166.mgmt.eqiad.wmnet with reboot policy FORCED [19:49:25] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [19:49:31] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [19:49:38] (03Abandoned) 10Jbond: C:varnish: drop carries netmapper config [puppet] - 10https://gerrit.wikimedia.org/r/769927 (owner: 10Jbond) [19:49:47] (03CR) 10Jbond: [C: 03+2] P:scap::dsh: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/771853 (owner: 10Jbond) [19:50:16] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1169.mgmt.eqiad.wmnet with reboot policy FORCED [19:50:18] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1170.mgmt.eqiad.wmnet with reboot policy FORCED [19:50:21] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1171.mgmt.eqiad.wmnet with reboot policy FORCED [19:50:48] (03Abandoned) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769448 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [19:51:44] (03PS3) 10Jbond: firmware fact: drop firmware_bios [puppet] - 10https://gerrit.wikimedia.org/r/765574 [19:52:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1168.mgmt.eqiad.wmnet with reboot policy FORCED [19:52:07] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1167.mgmt.eqiad.wmnet with reboot policy FORCED [19:52:25] (03PS1) 10Bking: wdqs: add CNAME for wdqs-ldf endpoint [dns] - 10https://gerrit.wikimedia.org/r/978131 (https://phabricator.wikimedia.org/T352111) [19:52:28] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1172.mgmt.eqiad.wmnet with reboot policy FORCED [19:52:28] (03CR) 10Jbond: "@jesse is this worth resurrecting?" [puppet] - 10https://gerrit.wikimedia.org/r/765342 (owner: 10Jbond) [19:52:29] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1173.mgmt.eqiad.wmnet with reboot policy FORCED [19:53:45] (03Abandoned) 10Jbond: pontoon: add profile::base::pontoon to list of classes [puppet] - 10https://gerrit.wikimedia.org/r/767533 (owner: 10Jbond) [19:53:47] (03CR) 10Bking: "check experimental" [dns] - 10https://gerrit.wikimedia.org/r/978131 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [19:54:23] (03Abandoned) 10Jbond: puppetdbquery: remove module [puppet] - 10https://gerrit.wikimedia.org/r/792117 (owner: 10Jbond) [19:55:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1165.mgmt.eqiad.wmnet with reboot policy FORCED [19:55:23] (03CR) 10Jbond: [V: 03+1] "@Emperor is this of any use?" [puppet] - 10https://gerrit.wikimedia.org/r/792227 (https://phabricator.wikimedia.org/T300057) (owner: 10Jbond) [19:57:02] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1165.mgmt.eqiad.wmnet with reboot policy FORCED [19:57:07] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1165.mgmt.eqiad.wmnet with reboot policy FORCED [19:57:18] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/690366 (owner: 10Jbond) [19:57:37] (03Abandoned) 10Jbond: wmflib::configparser_format: Replace legacy function with puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793064 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [19:57:47] (03PS1) 10Jcrespo: Migrate TLS configuration to separate file and prepare for puppet call [software/mediabackups] - 10https://gerrit.wikimedia.org/r/978133 (https://phabricator.wikimedia.org/T327157) [19:57:56] (03Abandoned) 10Jbond: C:graphite: Drop configparser_function in favour of wmflib::ini [puppet] - 10https://gerrit.wikimedia.org/r/793065 (owner: 10Jbond) [19:58:59] (03PS2) 10Jcrespo: Migrate TLS configuration to separate file and prepare for puppet call [software/mediabackups] - 10https://gerrit.wikimedia.org/r/978133 (https://phabricator.wikimedia.org/T327157) [19:59:17] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: add CNAME for wdqs-ldf endpoint [dns] - 10https://gerrit.wikimedia.org/r/978131 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [19:59:19] (03Abandoned) 10Jbond: C:varnish: create rate limit keyed on the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769469 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [19:59:50] (03PS2) 10Jbond: P:trafficserver::backend: netbox-next switch to netbox-next.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/797320 (https://phabricator.wikimedia.org/T296452) [20:00:01] (03CR) 10Jbond: "ready to review" [puppet] - 10https://gerrit.wikimedia.org/r/797320 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [20:00:03] (03CR) 10Bking: [C: 03+2] wdqs: add CNAME for wdqs-ldf endpoint [dns] - 10https://gerrit.wikimedia.org/r/978131 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [20:00:06] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/734999 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [20:00:59] (03Abandoned) 10Jbond: P:sretest: DO NOT MERGE - test prometheus::blackbox::check::http define [puppet] - 10https://gerrit.wikimedia.org/r/802737 (owner: 10Jbond) [20:01:17] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/735015 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [20:01:46] (03Abandoned) 10Jbond: hiera_export: add unmanaged (mostly) network devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/792644 (owner: 10Jbond) [20:02:12] (03Abandoned) 10Jbond: hieradata: decommission netbox servers [puppet] - 10https://gerrit.wikimedia.org/r/803489 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [20:02:18] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1170.mgmt.eqiad.wmnet with reboot policy FORCED [20:02:28] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/724733 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [20:02:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1171.mgmt.eqiad.wmnet with reboot policy FORCED [20:02:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1169.mgmt.eqiad.wmnet with reboot policy FORCED [20:03:14] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/735021 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [20:03:18] (03Abandoned) 10Jbond: P:netbox: Add hosts entry for service address [puppet] - 10https://gerrit.wikimedia.org/r/803508 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [20:04:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1172.mgmt.eqiad.wmnet with reboot policy FORCED [20:04:25] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1173.mgmt.eqiad.wmnet with reboot policy FORCED [20:04:56] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/735005 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [20:06:14] (03CR) 10JHathaway: wmflib: add ord function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736212 (owner: 10Jbond) [20:07:24] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1158.eqiad.wmnet with OS bullseye [20:07:33] (03CR) 10Ryan Kemper: miscweb: resolve wdqs ldf endpoint to wdqs1015 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977787 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [20:07:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye [20:07:53] 10SRE, 10Wikimedia-Incident: 503 Service Unavailable (all Wikimedia sites) - https://phabricator.wikimedia.org/T352182 (10DannyS712) >>! In T352182#9364058, @CDanis wrote: > Users are no longer impacted. > > Unfortunately due to the nature of the outage (a DDoS) we can't publish more details. For those with... [20:09:41] (03PS1) 10Subramanya Sastry: Work around Parsoid's messy handling of some extensions [extensions/DiscussionTools] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978067 (https://phabricator.wikimedia.org/T351461) [20:09:45] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/735029 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [20:09:49] (03PS1) 10Bking: query_service: point wdqs ldf endpoint to new CNAME [puppet] - 10https://gerrit.wikimedia.org/r/978134 (https://phabricator.wikimedia.org/T352111) [20:10:02] (03PS3) 10DDesouza: Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976844 (https://phabricator.wikimedia.org/T351353) [20:10:17] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1159.mgmt.eqiad.wmnet with reboot policy FORCED [20:12:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on planet1003.eqiad.wmnet with reason: maintenance [20:12:54] (03CR) 10JHathaway: "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/765574 (owner: 10Jbond) [20:13:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on planet1003.eqiad.wmnet with reason: maintenance [20:13:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on planet2003.codfw.wmnet with reason: maintenance [20:13:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on planet2003.codfw.wmnet with reason: maintenance [20:14:52] (03PS2) 10DDesouza: Increase coverage of Reader Demographics 2 surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973876 (https://phabricator.wikimedia.org/T344393) [20:16:01] (03CR) 10C. Scott Ananian: [C: 03+1] "Worth backporting to unblock Parsoid visual diff testing" [extensions/DiscussionTools] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978067 (https://phabricator.wikimedia.org/T351461) (owner: 10Subramanya Sastry) [20:16:06] (03CR) 10Dzahn: [C: 03+2] aphlict: move 'puppet-controlled config' to role-level [puppet] - 10https://gerrit.wikimedia.org/r/974668 (owner: 10Dzahn) [20:16:32] (03PS2) 10Bking: query_service: point wdqs ldf endpoint to new CNAME [puppet] - 10https://gerrit.wikimedia.org/r/978134 (https://phabricator.wikimedia.org/T352111) [20:17:49] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978134 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [20:21:19] (03PS3) 10DDesouza: Increase coverage of Reader Demographics 2 surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973876 (https://phabricator.wikimedia.org/T344393) [20:21:50] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1161.eqiad.wmnet with OS bullseye [20:21:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye [20:23:32] jouncebot: nowandnext [20:23:32] For the next 0 hour(s) and 36 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T1900) [20:23:33] In 0 hour(s) and 36 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T2100) [20:24:25] (03CR) 10Btullis: [C: 03+1] "Looks good to me too. Can we wait until tomorrow before merging please, just so I can check any side effects?" [puppet] - 10https://gerrit.wikimedia.org/r/735029 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [20:24:57] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:25:14] (03PS2) 10Ladsgroup: Disable VipsScaler in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977250 (https://phabricator.wikimedia.org/T290759) [20:25:19] (03CR) 10Ladsgroup: [C: 03+2] Disable VipsScaler in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977250 (https://phabricator.wikimedia.org/T290759) (owner: 10Ladsgroup) [20:26:08] (03PS1) 10C. Scott Ananian: DefaultOutputTransform::deduplicateStyles: don't match inside an attribute [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978068 [20:26:19] (03PS3) 10Bking: query_service: point wdqs ldf endpoint to new CNAME [puppet] - 10https://gerrit.wikimedia.org/r/978134 (https://phabricator.wikimedia.org/T352111) [20:26:24] (03PS1) 10Jforrester: [BETA CLUSTER] Set default value for wmgCentralAuthCookieDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978137 (https://phabricator.wikimedia.org/T352210) [20:26:30] (03CR) 10C. Scott Ananian: "Four-character fix helps unblock Parsoid visual diff testing" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978068 (owner: 10C. Scott Ananian) [20:26:35] (03CR) 10Ladsgroup: [C: 03+2] Disable VipsScaler in group0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977250 (https://phabricator.wikimedia.org/T290759) (owner: 10Ladsgroup) [20:26:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977250 (https://phabricator.wikimedia.org/T290759) (owner: 10Ladsgroup) [20:27:00] (03CR) 10Subramanya Sastry: [C: 03+1] DefaultOutputTransform::deduplicateStyles: don't match inside an attribute [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978068 (owner: 10C. Scott Ananian) [20:27:34] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978134 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [20:27:52] (03PS1) 10Papaul: Add new restbase nodes [puppet] - 10https://gerrit.wikimedia.org/r/978138 (https://phabricator.wikimedia.org/T349758) [20:28:43] (03Merged) 10jenkins-bot: Disable VipsScaler in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977250 (https://phabricator.wikimedia.org/T290759) (owner: 10Ladsgroup) [20:29:04] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:977250|Disable VipsScaler in group0 (T290759)]] [20:29:10] T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759 [20:29:27] (03CR) 10CI reject: [V: 04-1] [BETA CLUSTER] Set default value for wmgCentralAuthCookieDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978137 (https://phabricator.wikimedia.org/T352210) (owner: 10Jforrester) [20:30:28] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:977250|Disable VipsScaler in group0 (T290759)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:31:53] (03PS2) 10Jforrester: [BETA CLUSTER] Set default value for wmgCentralAuthCookieDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978137 (https://phabricator.wikimedia.org/T352210) [20:32:23] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [20:34:05] (03PS1) 10Bking: WIP: do not merge [puppet] - 10https://gerrit.wikimedia.org/r/978140 (https://phabricator.wikimedia.org/T352111) [20:34:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978140 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [20:36:02] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host an-worker1158.eqiad.wmnet with OS bullseye [20:38:08] (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::base::puppetmaster::frontend: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971447 (owner: 10Muehlenhoff) [20:39:13] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:977250|Disable VipsScaler in group0 (T290759)]] (duration: 10m 08s) [20:39:17] (03Abandoned) 10Bking: WIP: do not merge [puppet] - 10https://gerrit.wikimedia.org/r/978140 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [20:39:18] T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759 [20:40:25] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1159.mgmt.eqiad.wmnet with reboot policy FORCED [20:41:02] (03PS4) 10Ryan Kemper: query_service: point wdqs ldf endpoint to new CNAME [puppet] - 10https://gerrit.wikimedia.org/r/978134 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [20:41:58] (03CR) 10Ryan Kemper: query_service: point wdqs ldf endpoint to new CNAME (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/978134 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [20:43:52] 10SRE-swift-storage: Swift container for archived mariadb tables - https://phabricator.wikimedia.org/T350924 (10Ladsgroup) What should we do to move this forward? [20:44:46] (03CR) 10Papaul: [C: 03+2] Add new restbase nodes [puppet] - 10https://gerrit.wikimedia.org/r/978138 (https://phabricator.wikimedia.org/T349758) (owner: 10Papaul) [20:46:47] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1158.eqiad.wmnet with OS bullseye [20:46:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye [20:48:55] (03CR) 10JHathaway: (WIP) bolt: Add bolt rake tasks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765342 (owner: 10Jbond) [20:48:59] (03CR) 10Dzahn: query_service: point wdqs ldf endpoint to new CNAME (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/978134 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [20:50:11] (03PS2) 10Dzahn: hieradata: delete puppet7 hiera keys for planet hosts [puppet] - 10https://gerrit.wikimedia.org/r/977798 [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231128T2100). Please do the needful. [21:00:05] danisztls and subbu: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:18] o/ [21:00:43] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1158.eqiad.wmnet with reason: host reimage [21:00:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1159.mgmt.eqiad.wmnet with reboot policy FORCED [21:01:16] (03PS1) 10Bking: The wdqs ldf endpoint (query.wikidata.org/bigdata/ldf) is hosted from a single server. Create a CNAME under discovery services so we don't have to update multiple places (monitoring, ATS, etc) when we update hosts. [dns] - 10https://gerrit.wikimedia.org/r/978142 (https://phabricator.wikimedia.org/T347355) [21:02:12] (03CR) 10Dzahn: [C: 03+1] The wdqs ldf endpoint (query.wikidata.org/bigdata/ldf) is hosted from a single server. Create a CNAME under discovery services so we don't h [dns] - 10https://gerrit.wikimedia.org/r/978142 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [21:02:19] Waving, but not at laptop ATM. Can deploy in a few if no one gets here. [21:02:55] (03CR) 10Bking: [C: 03+2] The wdqs ldf endpoint (query.wikidata.org/bigdata/ldf) is hosted from a single server. Create a CNAME under discovery services so we don't h [dns] - 10https://gerrit.wikimedia.org/r/978142 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [21:03:17] looks like 2 patches from danisztls and 2 from our team (subbu and cscott) [21:03:37] o/ [21:03:41] sry, internet issues [21:04:04] cscott: yup. And yours are backports. Merging now, should merge by time I'm at my laptop. [21:04:08] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1158.eqiad.wmnet with reason: host reimage [21:05:08] (03CR) 10Urbanecm: [C: 03+2] DefaultOutputTransform::deduplicateStyles: don't match inside an attribute [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978068 (owner: 10C. Scott Ananian) [21:05:12] (03CR) 10Urbanecm: [C: 03+2] Work around Parsoid's messy handling of some extensions [extensions/DiscussionTools] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978067 (https://phabricator.wikimedia.org/T351461) (owner: 10Subramanya Sastry) [21:07:16] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2029.codfw.wmnet with OS bullseye [21:07:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, 10Patch-For-Review: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2029.codfw.wmnet with OS bullseye executed wit... [21:08:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2029.codfw.wmnet with OS bullseye [21:08:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, 10Patch-For-Review: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2029.codfw.wmnet with OS bullseye [21:12:42] (03PS4) 10DDesouza: Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976844 (https://phabricator.wikimedia.org/T351353) [21:12:51] (03PS4) 10DDesouza: Increase coverage of Reader Demographics 2 surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973876 (https://phabricator.wikimedia.org/T344393) [21:16:08] (03CR) 10Bartosz Dziewoński: [C: 03+1] "That will probably fix it, thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978137 (https://phabricator.wikimedia.org/T352210) (owner: 10Jforrester) [21:16:45] (temporarily going offline for a minute ... reconnecting with a different wifi connection) [21:17:27] back [21:18:39] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [21:19:26] Hello. Added something to current window just a few minutes ago. [21:20:37] subbu: please leave T976844 for later, it might be postponed [21:21:23] urbanecm, ^^ from danisztls [21:21:34] ty. [21:21:39] i'm at my laptop now, sorry for the delay [21:21:54] danisztls, i am not the deployer, urbanecm is. [21:22:08] :) [21:22:09] sry [21:22:11] although feel free to if you want to :) [21:22:40] danisztls: what do you mean by T976844 though? https://phabricator.wikimedia.org/T976844 gives a 404 to me. [21:23:08] or you mean https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/976844/? [21:23:18] ignore the T ;) [21:23:25] urbanecm: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/976844 [21:23:45] okay. so...you only need me to deploy the increase coverage patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/973876/), right? [21:24:17] urbanecm: yes [21:24:26] (03CR) 10Urbanecm: [C: 03+2] Increase coverage of Reader Demographics 2 surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973876 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:24:29] let's do it [21:24:35] (03Merged) 10jenkins-bot: DefaultOutputTransform::deduplicateStyles: don't match inside an attribute [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978068 (owner: 10C. Scott Ananian) [21:24:38] (03CR) 10CI reject: [V: 04-1] Work around Parsoid's messy handling of some extensions [extensions/DiscussionTools] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978067 (https://phabricator.wikimedia.org/T351461) (owner: 10Subramanya Sastry) [21:25:01] ... [21:25:06] that's not nice of you, jenkins [21:25:21] jerkins gonna jerk [21:25:28] (03Merged) 10jenkins-bot: Increase coverage of Reader Demographics 2 surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973876 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:25:34] i thought we abandoned that feature :) [21:25:48] https://integration.wikimedia.org/ci/job/mwext-php74-phan-docker/82992/console is a weird-ish failure. subbu, can you take a look please? [21:26:33] (03PS1) 10Jdlrobson: Fixes: Duplicate events for radio buttons [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978071 (https://phabricator.wikimedia.org/T352075) [21:26:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2029.codfw.wmnet with reason: host reimage [21:26:57] "castor-save-workspace-cache aborted." sounds like a full disk somewhere. and in post-build so not really a trest failure. [21:27:41] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:973876|Increase coverage of Reader Demographics 2 surveys (T344393)]], [[gerrit:978068|DefaultOutputTransform::deduplicateStyles: don't match inside an attribute]] [21:27:52] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [21:28:54] urbanecm, ok, looking. [21:29:04] !log urbanecm@deploy2002 cscott and urbanecm and dani: Backport for [[gerrit:973876|Increase coverage of Reader Demographics 2 surveys (T344393)]], [[gerrit:978068|DefaultOutputTransform::deduplicateStyles: don't match inside an attribute]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:29:11] or maybe just restart, full disk sounds plausible [21:29:33] subbu: cscott: danisztls: first backport and the coverage increase are at mwdebug2001, can you take a look? [21:29:45] urbanecm: yes [21:30:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2029.codfw.wmnet with reason: host reimage [21:30:08] urbanecm, ya .. retry? [21:30:20] (03CR) 10Urbanecm: [C: 03+2] Work around Parsoid's messy handling of some extensions [extensions/DiscussionTools] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978067 (https://phabricator.wikimedia.org/T351461) (owner: 10Subramanya Sastry) [21:30:23] let's see [21:30:57] integration-castor05 seems to have lots of free space generally, but I don't really know anything about the service it provides [21:31:43] cscott, urbanecm .. testing the core patch on mwdebug* and see if it fixes the issue. [21:31:52] ty [21:32:08] urbanecm: looks good [21:32:11] ack [21:33:52] cscott, that patch doesn't seem to fix it ... i tried action=purge a couple times now. [21:34:36] Last backport I was having trouble getting action=purge to work with X-Wikimedia-Debug; the parse after the purge wasn't reliably running on the canary machine. [21:35:44] Assuming that you can confirm that we're not breaking correct legacy usage of style deduplication, i'd let the deploy run all the way through then try again. [21:35:55] if it's not making things worse, i can sync it out and we can see in prod to confirm. [21:37:13] (03Merged) 10jenkins-bot: Work around Parsoid's messy handling of some extensions [extensions/DiscussionTools] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978067 (https://phabricator.wikimedia.org/T351461) (owner: 10Subramanya Sastry) [21:37:33] cscott, why don't you try this on your end and see if you have better luck, just in case? [21:37:42] sure [21:37:54] the discussiontools merged, so failure was transient at least [21:38:03] https://en.wikipedia.org/wiki/Ōpunake and https://en.wikipedia.org/wiki/Gospel_Oak_railway_station are couple pages you could try [21:38:23] kimberly_sarabia: just saw your additions to the backport window. are you around for deploying them? [21:38:27] with "?useparsoid=1" you will see that long citation superscript because of the breakage if it isn't fixed. [21:38:27] yes [21:38:38] ack, let's merge them [21:38:43] thanks [21:38:56] (03PS1) 10Urbanecm: Fixes: Duplicate events for radio buttons [skins/Vector] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/978072 (https://phabricator.wikimedia.org/T352075) [21:39:02] (03CR) 10Urbanecm: [C: 03+2] Fixes: Duplicate events for radio buttons [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978071 (https://phabricator.wikimedia.org/T352075) (owner: 10Jdlrobson) [21:39:10] (03CR) 10Urbanecm: [C: 03+2] Fixes: Duplicate events for radio buttons [skins/Vector] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/978072 (https://phabricator.wikimedia.org/T352075) (owner: 10Urbanecm) [21:40:07] (03CR) 10Dzahn: [C: 03+2] hieradata: delete puppet7 hiera keys for planet hosts [puppet] - 10https://gerrit.wikimedia.org/r/977798 (owner: 10Dzahn) [21:42:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1161.eqiad.wmnet with OS bullseye [21:42:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [21:43:43] cscott: how is it looking please? :) [21:43:55] sorry, working on it still [21:44:05] (03PS1) 10Dzahn: Revert "hieradata: delete puppet7 hiera keys for planet hosts" [puppet] - 10https://gerrit.wikimedia.org/r/978073 [21:44:25] (03CR) 10Dzahn: [C: 03+2] Revert "hieradata: delete puppet7 hiera keys for planet hosts" [puppet] - 10https://gerrit.wikimedia.org/r/978073 (owner: 10Dzahn) [21:44:43] no worries, just checking. waiting. [21:47:01] (03CR) 10Muehlenhoff: "The role can only be converted when the old Buster VMs are decommed, we can't provide the Puppet 7 agent for Buster due to the old Ruby." [puppet] - 10https://gerrit.wikimedia.org/r/978073 (owner: 10Dzahn) [21:47:37] (03CR) 10Dzahn: [C: 03+2] "yea, that's why I am reverting. It's already messed up though.. hrmm" [puppet] - 10https://gerrit.wikimedia.org/r/978073 (owner: 10Dzahn) [21:47:53] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:48:32] subbu: the reason your test didn't work is that Gospel Oak railway station is on enwiki and wmf.7 isn't on enwiki yet [21:49:01] aaah!! [21:49:17] (03CR) 10Dzahn: [C: 03+2] "since I have unrelated issues with the new python code that fetches the feeds I might go back to insetup and reimage" [puppet] - 10https://gerrit.wikimedia.org/r/978073 (owner: 10Dzahn) [21:49:24] that makes a lot of sense [21:49:27] so, ok to proceed? [21:49:46] (03CR) 10Muehlenhoff: Revert "hieradata: delete puppet7 hiera keys for planet hosts" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/978073 (owner: 10Dzahn) [21:49:49] give me 1 minute more, i've almost got a repo built on mediawiki org which should be on wmf.7 [21:49:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:49:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2029.codfw.wmnet with OS bullseye [21:50:00] that will help me when i go trying to test the next fix .. i had an enwiki patch for the next one as well. [21:50:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2029.codfw.wmnet with OS bullseye completed: - restbase2029 (**PASS*... [21:50:09] (03CR) 10Dzahn: [C: 03+2] Revert "hieradata: delete puppet7 hiera keys for planet hosts" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/978073 (owner: 10Dzahn) [21:50:15] (next fix = discussion tools) .. i would have been similarly frustrated. [21:50:44] heh [21:50:48] cscott: ack, waiting [21:51:24] (03CR) 10Dzahn: [C: 03+2] "I still have the exact changes it made in /etc/puppet/puppet.conf in front of me. I could just manually change that back and run puppet." [puppet] - 10https://gerrit.wikimedia.org/r/978073 (owner: 10Dzahn) [21:51:32] ok, i confirmed that deduplication isn't broken on https://www.mediawiki.org/wiki/User:Cscott/TemplateStyles_Test for both useparsoid=0 and useparsoid=1 [21:52:13] (i got a testwiki test for my dt fix). [21:52:15] yay [21:52:18] i haven't fully reproduced the "deduplicated style element in quoted data-mw attribute" test case yet, but at least this patch doesn't regress anything so i think it's safe to proceed [21:52:23] ok, proceeding [21:52:24] !log urbanecm@deploy2002 cscott and urbanecm and dani: Continuing with sync [21:53:29] cscott, so i suppose we really should have backported this to wmf.5 ... live and learn! [21:53:41] same with discussion tools .. .but i can wait till thursday to run the full test. [21:54:11] (03PS5) 10Bking: query_service: point wdqs ldf endpoint to new CNAME [puppet] - 10https://gerrit.wikimedia.org/r/978134 (https://phabricator.wikimedia.org/T352111) [21:54:15] (03CR) 10Bking: query_service: point wdqs ldf endpoint to new CNAME (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/978134 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [21:55:55] (03Merged) 10jenkins-bot: Fixes: Duplicate events for radio buttons [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/978071 (https://phabricator.wikimedia.org/T352075) (owner: 10Jdlrobson) [21:57:09] (03Merged) 10jenkins-bot: Fixes: Duplicate events for radio buttons [skins/Vector] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/978072 (https://phabricator.wikimedia.org/T352075) (owner: 10Urbanecm) [21:57:20] patches merged, waiting on previous sync to complete [21:58:50] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:973876|Increase coverage of Reader Demographics 2 surveys (T344393)]], [[gerrit:978068|DefaultOutputTransform::deduplicateStyles: don't match inside an attribute]] (duration: 31m 09s) [21:58:55] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [21:59:11] subbu: https://www.mediawiki.org/wiki/User:Cscott/TemplateStyles_Test?useparsoid=1 is a good test case, I think, and it doesn't show any problems on the canary [21:59:13] subbu: cscott: core backport synced [21:59:26] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978134 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [21:59:26] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:978072|Fixes: Duplicate events for radio buttons (T352075)]], [[gerrit:978071|Fixes: Duplicate events for radio buttons (T352075)]], [[gerrit:978067|Work around Parsoid's messy handling of some extensions (T351461)]] [21:59:30] danisztls: and your config change too :) [21:59:33] T352075: Duplicate Events Generated when Interacting with Radio Buttons - https://phabricator.wikimedia.org/T352075 [21:59:33] T351461: InvalidArgumentException: Multiple conflicting values given for wgDiscussionToolsPageThreads - https://phabricator.wikimedia.org/T351461 [21:59:36] urbanecm thanks! [21:59:40] np [21:59:55] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977725 [21:59:58] thanks. we'll have to create a test case that duplicates the enwiki breakage and see if that is fixed .. but we can do that separately after. [22:00:07] urbanecm: thanks! [22:01:04] !log urbanecm@deploy2002 urbanecm and ssastry and jdlrobson: Backport for [[gerrit:978072|Fixes: Duplicate events for radio buttons (T352075)]], [[gerrit:978071|Fixes: Duplicate events for radio buttons (T352075)]], [[gerrit:978067|Work around Parsoid's messy handling of some extensions (T351461)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:01:26] kimberly_sarabia: cscott: subbu: can you test your patches at mwdebug please? [22:01:33] Yes one moment [22:01:35] ty [22:01:35] will do. [22:01:49] subbu: I think https://www.mediawiki.org/wiki/User:Cscott/TemplateStyles_Test?useparsoid=1 is that test case.  I should be able to copy it onto enwiki and show that it is still broken then. [22:03:26] urbanecm, my discussion tools patch works! .. okay to sync https://test.wikipedia.org/wiki/Wikipedia_talk:What_Test_Wiki_is_not?useparsoid=1 renders fine now. [22:03:30] yay [22:04:02] cscott saved me some time and frustration by pointing out the wmf.5 / wmf.7 issue :) it would have taken me a while before I noticed that! [22:04:34] LGTM [22:04:36] thanks [22:04:37] ty [22:04:38] !log urbanecm@deploy2002 urbanecm and ssastry and jdlrobson: Continuing with sync [22:04:39] proceeding [22:04:57] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [22:08:45] subbu: ok with looking at the wikifeeds stuff as well? [22:09:52] sure ... let me read up the notes there first. [22:09:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2031.codfw.wmnet with OS bullseye [22:10:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2031.codfw.wmnet with OS bullseye [22:10:10] k. waiting on the backport to finish anyway. [22:10:12] (03CR) 10Jforrester: [C: 03+2] [BETA CLUSTER] Set default value for wmgCentralAuthCookieDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978137 (https://phabricator.wikimedia.org/T352210) (owner: 10Jforrester) [22:10:50] urbanecm, okay .. i suppose with wikifeeds k8s .. it goes live everywhere since it is a nodejs service (like it was with parsoid back in the day) .. correct? [22:10:54] (03Merged) 10jenkins-bot: [BETA CLUSTER] Set default value for wmgCentralAuthCookieDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978137 (https://phabricator.wikimedia.org/T352210) (owner: 10Jforrester) [22:11:57] subbu: sort of. it reaches staging first, so i can test it with curl (and/or you can if you ssh in) [22:12:29] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:978072|Fixes: Duplicate events for radio buttons (T352075)]], [[gerrit:978071|Fixes: Duplicate events for radio buttons (T352075)]], [[gerrit:978067|Work around Parsoid's messy handling of some extensions (T351461)]] (duration: 13m 02s) [22:12:38] T352075: Duplicate Events Generated when Interacting with Radio Buttons - https://phabricator.wikimedia.org/T352075 [22:12:38] T351461: InvalidArgumentException: Multiple conflicting values given for wgDiscussionToolsPageThreads - https://phabricator.wikimedia.org/T351461 [22:12:56] urbanecm, ssh to deploy1002 ? [22:13:09] yup. gimme a moment though. [22:13:14] ok. [22:16:42] (03CR) 10Urbanecm: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977725 (owner: 10PipelineBot) [22:17:31] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977725 (owner: 10PipelineBot) [22:19:59] !log urbanecm@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [22:20:28] !log urbanecm@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [22:21:29] subbu: okay, so `curl https://staging.svc.eqiad.wmnet:4101/en.wikipedia.org/v1/feed/announcements | jq . | less` at deploy2002 now has the updated messaging [22:21:33] output lgtm [22:21:55] wfm .. [22:22:13] !log urbanecm@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [22:22:18] okay [22:22:46] !log urbanecm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [22:22:53] !log urbanecm@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [22:23:17] !log urbanecm@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [22:23:41] subbu: and we're live [22:24:26] and looks good to me as well. [22:24:37] good :) [22:24:43] https://en.wikipedia.org/api/rest_v1/feed/announcements?x=3 (i added the ?x=3 to bypass caches) [22:25:13] i purged the cache and it works w/o x=3 too [22:27:05] https://wikitech.wikimedia.org/wiki/Deployment_server says deploy1002 is active but when I logged on, i got a big message saying deploy2002.codfw is the active one .. so, i'm going to update that wikipage [22:27:35] ty [22:27:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2031.codfw.wmnet with reason: host reimage [22:30:55] ty urbanecm ... going to sign off now for a bit .. being kicked out of thi scoffee shop! :) [22:31:18] heh. me too. not in a coffee shop tho. bye for now! [22:33:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2031.codfw.wmnet with reason: host reimage [22:33:05] !log cp4052 - disabling puppet to experiment on how we gather prometheus stats from ATS... [22:33:06] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2030.codfw.wmnet with OS bullseye [22:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2030.codfw.wmnet with OS bullseye executed with errors: - restbase20... [22:43:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:49:56] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:51:20] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:51:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:51:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2031.codfw.wmnet with OS bullseye [22:51:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2031.codfw.wmnet with OS bullseye completed: - restbase2031 (**PASS*... [22:51:52] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:51:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2030.codfw.wmnet with OS bullseye [22:51:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2030.codfw.wmnet with OS bullseye [22:52:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) [22:54:57] (JobUnavailable) firing: (5) Reduced availability for job pybal in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:00:42] (SystemdUnitFailed) firing: envoyproxy.service Failed on wdqs2020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:01:54] PROBLEM - WDQS SPARQL on wdqs2020 is CRITICAL: connect to address 10.192.0.85 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:01:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2032.codfw.wmnet with OS bullseye [23:02:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2032.codfw.wmnet with OS bullseye [23:03:00] PROBLEM - Check systemd state on wdqs2020 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:03:00] PROBLEM - Check systemd state on wdqs2016 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:03:44] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: connect to address 10.192.0.141 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:04:16] PROBLEM - Check systemd state on wdqs2009 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:04:18] PROBLEM - WDQS SPARQL on wdqs2016 is CRITICAL: connect to address 10.192.16.193 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:04:28] ^ We're aware of this and looking into it now [23:05:14] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:05:23] !log cp4052 - depool temporarily [23:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:42] (SystemdUnitFailed) firing: (4) envoyproxy.service Failed on wdqs1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:05:44] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:42] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:07:01] !log cp4052 - repool [23:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2030.codfw.wmnet with reason: host reimage [23:10:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:10:54] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:11:10] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:13:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2030.codfw.wmnet with reason: host reimage [23:13:51] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:13:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2033.codfw.wmnet with OS bullseye [23:14:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2033.codfw.wmnet with OS bullseye [23:14:35] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:15:46] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [23:15:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1158.eqiad.wmnet with OS bullseye [23:15:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye completed: - an-worker1158 (**WA... [23:19:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2032.codfw.wmnet with reason: host reimage [23:22:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2032.codfw.wmnet with reason: host reimage [23:23:17] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [23:25:39] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1159.eqiad.wmnet with OS bullseye [23:25:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1159.eqiad.wmnet with OS bullseye [23:27:38] (03PS1) 10Bking: wdqs: Add reissued wdqs.discovery.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/978149 (https://phabricator.wikimedia.org/T352111) [23:28:22] (03PS2) 10Bking: wdqs: Add reissued wdqs.discovery.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/978149 (https://phabricator.wikimedia.org/T352111) [23:29:29] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: Add reissued wdqs.discovery.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/978149 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [23:29:49] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:30:08] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: Add reissued wdqs.discovery.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/978149 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [23:30:10] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: Add reissued wdqs.discovery.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/978149 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [23:31:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:31:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2030.codfw.wmnet with OS bullseye [23:31:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2030.codfw.wmnet with OS bullseye completed: - restbase2030 (**PASS*... [23:31:16] (03PS1) 10Papaul: Add new elastic nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/978150 (https://phabricator.wikimedia.org/T349778) [23:31:31] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:31:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2033.codfw.wmnet with reason: host reimage [23:31:45] (03CR) 10Krinkle: [C: 03+1] arclamp: redirect alerts to o11y [alerts] - 10https://gerrit.wikimedia.org/r/978063 (https://phabricator.wikimedia.org/T349159) (owner: 10Herron) [23:32:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) [23:32:47] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:32:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2034.codfw.wmnet with OS bullseye [23:32:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2034.codfw.wmnet with OS bullseye [23:33:03] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:33:17] (03CR) 10Papaul: [C: 03+2] Add new elastic nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/978150 (https://phabricator.wikimedia.org/T349778) (owner: 10Papaul) [23:33:37] RECOVERY - Check systemd state on wdqs2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2033.codfw.wmnet with reason: host reimage [23:35:42] (SystemdUnitFailed) firing: (4) envoyproxy.service Failed on wdqs1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:36:25] (03CR) 10Andrew Bogott: "This change (or something similar) still seems necessary. See the broken state of abogott-puppet7.testlabs.eqiad1.wikimedia.cloud to see w" [puppet] - 10https://gerrit.wikimedia.org/r/975089 (https://phabricator.wikimedia.org/T351468) (owner: 10Andrew Bogott) [23:36:33] RECOVERY - WDQS SPARQL on wdqs2020 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.375 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:36:43] RECOVERY - Check systemd state on wdqs2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:11] RECOVERY - Check systemd state on wdqs2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:21] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.427 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:37:33] (03Abandoned) 10Andrew Bogott: Test patch to see if the linter is misbehaving. [software/cumin] - 10https://gerrit.wikimedia.org/r/968762 (owner: 10Andrew Bogott) [23:38:05] (03Abandoned) 10Andrew Bogott: wmf_sink: catch ssl errors when talking to the proxy server [puppet] - 10https://gerrit.wikimedia.org/r/952919 (https://phabricator.wikimedia.org/T345103) (owner: 10Andrew Bogott) [23:38:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Papaul) [23:38:32] (03CR) 10Andrew Bogott: [C: 03+1] team-wmcs: improve host down alerts [alerts] - 10https://gerrit.wikimedia.org/r/977743 (https://phabricator.wikimedia.org/T352059) (owner: 10Majavah) [23:39:35] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1159.eqiad.wmnet with reason: host reimage [23:39:52] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:39:52] (03CR) 10Andrew Bogott: [C: 03+2] Add clean-stale-puppet-certs script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829321 (owner: 10Andrew Bogott) [23:40:43] (SystemdUnitFailed) resolved: (4) envoyproxy.service Failed on wdqs1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:41:26] (03CR) 10Andrew Bogott: [C: 03+1] P:openstack: nova: add script to run console commands [puppet] - 10https://gerrit.wikimedia.org/r/967219 (https://phabricator.wikimedia.org/T347683) (owner: 10Majavah) [23:41:35] RECOVERY - WDQS SPARQL on wdqs2016 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:42:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:42:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2032.codfw.wmnet with OS bullseye [23:42:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2032.codfw.wmnet with OS bullseye completed: - restbase2032 (**PASS*... [23:42:46] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1160.eqiad.wmnet with OS bullseye [23:42:48] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1159.eqiad.wmnet with reason: host reimage [23:42:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye [23:43:46] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:21] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1161.eqiad.wmnet with OS bullseye [23:48:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1161.eqiad.wmnet with OS bullseye [23:49:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [23:50:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2034.codfw.wmnet with reason: host reimage [23:51:19] !log cp4052 - all back to normal [23:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:58] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:53:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2035.codfw.wmnet with OS bullseye [23:53:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2035.codfw.wmnet with OS bullseye [23:53:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:53:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2033.codfw.wmnet with OS bullseye [23:53:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2034.codfw.wmnet with reason: host reimage [23:53:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2033.codfw.wmnet with OS bullseye completed: - restbase2033 (**PASS*... [23:58:00] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:58:04] (03PS1) 10Andrew Bogott: wikitech: remove port 80 ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/978154 [23:58:51] (03CR) 10Andrew Bogott: "proposed alternative https://gerrit.wikimedia.org/r/c/operations/puppet/+/978154" [puppet] - 10https://gerrit.wikimedia.org/r/977170 (owner: 10Muehlenhoff)