[00:00:10] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:59] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:13:35] (03CR) 10Andrea Denisse: "Hello, here are the PCC results for the latest patch: https://puppet-compiler.wmflabs.org/output/909738/40869/" [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [00:13:52] (03CR) 10Andrea Denisse: prometheus: Add support for syncing data between Prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [00:21:04] (03CR) 10Aaron Schulz: webperf: enable libapache2-mod-php7.4 on profile::webperf::site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [00:31:06] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/911387 (owner: 10TrainBranchBot) [00:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:39:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/911840 [00:39:28] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/911840 (owner: 10TrainBranchBot) [00:42:15] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:21] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/911840 (owner: 10TrainBranchBot) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:17] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:21:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:30] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:36:01] 10ops-ulsfo, 10DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [02:41:47] 10ops-ulsfo, 10DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [02:55:47] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10phaultfinder) [02:56:01] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10phaultfinder) [02:56:03] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [03:00:10] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:47] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [03:16:01] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [03:52:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [04:27:53] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:19:36] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 59360 [05:19:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 59360 [05:19:55] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 59360 [05:19:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 59360 [05:20:14] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 59369 [05:21:31] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 59369 [05:33:07] !log bounce SGIX RS BGP - T327284 [05:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:44] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) [05:33:51] (03CR) 10Marostegui: [C: 03+2] data.yaml: Add Panagiotis Penloglou [puppet] - 10https://gerrit.wikimedia.org/r/911877 (https://phabricator.wikimedia.org/T335353) (owner: 10Marostegui) [05:35:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) 05Open→03Resolved a:03Marostegui The patch giving access has been merged. Please allow 30 minutes for the change to run across... [05:43:43] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 703 [05:44:06] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 703 [05:44:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 703 [05:45:17] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 703 [05:46:05] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 703 [05:46:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 703 [05:50:23] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45430 [05:51:22] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45430 [05:51:27] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2519 [05:52:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2519 [05:52:47] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 55967 [05:53:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 55967 [05:53:31] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2518 [05:53:46] (03PS1) 10Marostegui: wmnet: Update parsercache CNAME [dns] - 10https://gerrit.wikimedia.org/r/912171 (https://phabricator.wikimedia.org/T327920) [05:54:12] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Marostegui) [05:54:25] (03CR) 10Marostegui: [C: 04-2] "Wait for eqiad to be active" [dns] - 10https://gerrit.wikimedia.org/r/912171 (https://phabricator.wikimedia.org/T327920) (owner: 10Marostegui) [05:54:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2518 [05:54:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 131285 [05:55:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 131285 [05:56:03] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 63199 [05:57:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 [05:57:16] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38158 [05:58:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38158 [05:58:29] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 10030 [05:59:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 10030 [05:59:29] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 139836 [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230426T0600) [06:00:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 139836 [06:01:11] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9584 [06:01:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9584 [06:01:48] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 906 [06:02:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 906 [06:02:32] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 10089 [06:03:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 10089 [06:04:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 35280 [06:05:29] ACKNOWLEDGEMENT - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 803280 MB (3% inode=99%): Marostegui https://phabricator.wikimedia.org/T330573 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [06:06:08] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 35280 [06:06:24] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 136106 [06:07:09] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 136106 [06:07:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 18403 [06:08:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 18403 [06:08:29] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 23824 [06:09:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 23824 [06:09:12] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 199524 [06:10:18] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [06:10:41] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 [06:10:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4775 [06:11:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4775 [06:11:41] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 136907 [06:12:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 136907 [06:12:40] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6939 [06:12:45] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'email' for AS: 6939 [06:13:01] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 49544 [06:14:09] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 49544 [06:14:12] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4761 [06:14:48] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4761 [06:14:54] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 140951 [06:15:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 140951 [06:15:24] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 133840 [06:15:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 133840 [06:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:16:04] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4773 [06:17:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4773 [06:17:17] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9009 [06:18:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9009 [06:18:21] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 55818 [06:19:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 55818 [06:19:15] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 54994 [06:20:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 54994 [06:20:25] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 17961 [06:20:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 17961 [06:20:52] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 23947 [06:21:36] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 23947 [06:21:54] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 58552 [06:23:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 58552 [06:23:45] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 132132 [06:24:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 132132 [06:24:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4651 [06:25:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4651 [06:25:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38040 [06:25:30] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:27:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38040 [06:27:38] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8529 [06:29:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8529 [06:29:21] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9299 [06:30:33] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9299 [06:31:00] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 23951 [06:32:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 23951 [06:32:08] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9002 [06:32:21] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:32:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9002 [06:32:54] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 137831 [06:33:11] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:33:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 137831 [06:34:14] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 24482 [06:35:42] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 24482 [06:35:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9583 [06:36:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9583 [06:36:38] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 134823 [06:37:09] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 134823 [06:37:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45498 [06:37:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45498 [06:38:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 17676 [06:38:38] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 17676 [06:38:48] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 36351 [06:40:47] 10ops-ulsfo, 10DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [06:40:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 36351 [06:40:59] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 1239 [06:42:02] 10ops-ulsfo, 10DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10phaultfinder) [06:42:20] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 1239 [06:42:25] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4657 [06:43:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4657 [06:44:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38082 [06:45:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38082 [06:45:46] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 1828 [06:47:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 1828 [06:48:05] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 140407 [06:48:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 140407 [06:49:22] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45796 [06:49:51] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45796 [06:49:58] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 7552 [06:50:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7552 [06:51:18] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 18106 [06:53:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 18106 [06:53:45] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4826 [06:55:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4826 [06:56:01] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10phaultfinder) [06:56:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 133840 [06:57:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 133840 [06:59:36] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 49544 [07:00:05] Amir1, Urbanecm, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230426T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [07:00:46] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [07:00:48] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10phaultfinder) [07:00:51] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [07:01:12] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 49544 [07:01:59] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 55818 [07:03:20] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 55818 [07:04:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 4826 [07:06:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 4826 [07:06:55] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 9584 [07:08:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 9584 [07:09:31] Hi, taavi, if you are around, I have beta-only config change that would be nice to see deployed. But if that is not possible, that is also totally fine, because I did not schedule it in time for the current window after all [07:09:43] this one: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/911311 [07:16:00] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 38082 [07:16:01] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [07:16:52] (03CR) 10Jelto: "looks mostly good. One little comment in-line regarding the help test of the parameters." [cookbooks] - 10https://gerrit.wikimedia.org/r/911951 (owner: 10EoghanGaffney) [07:17:37] (03CR) 10Jelto: [gitlab/failover] Rename host flags (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/911951 (owner: 10EoghanGaffney) [07:18:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 38082 [07:18:52] (03PS3) 10Majavah: Beta-Wikidata: Enable Labels in Wikidata edit summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911311 (https://phabricator.wikimedia.org/T327062) (owner: 10Michael Große) [07:19:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911311 (https://phabricator.wikimedia.org/T327062) (owner: 10Michael Große) [07:19:49] (03Merged) 10jenkins-bot: Beta-Wikidata: Enable Labels in Wikidata edit summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911311 (https://phabricator.wikimedia.org/T327062) (owner: 10Michael Große) [07:19:58] Thank you! ❤️ [07:20:46] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10phaultfinder) [07:20:51] !log taavi@deploy1002 Started scap: Backport for [[gerrit:911311|Beta-Wikidata: Enable Labels in Wikidata edit summaries (T327062)]] [07:20:57] T327062: Show entity labels in parsed edit summaries in API requests as well - https://phabricator.wikimedia.org/T327062 [07:21:35] MichaelG_WMDE: since it touches a non-labs file too, I'm going to need to sync it to prod [07:22:19] !log taavi@deploy1002 taavi and migr: Backport for [[gerrit:911311|Beta-Wikidata: Enable Labels in Wikidata edit summaries (T327062)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [07:22:49] wikidata still seems to be up, so I'll just sync ig [07:24:05] Ah gotcha. That is true and I understand, though I expect it to not affect prod at all. [07:26:53] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] IDM: Add placeholders for mediawiki OAuth [labs/private] - 10https://gerrit.wikimedia.org/r/911862 (owner: 10Slyngshede) [07:26:56] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] IDM: Add placeholders for mediawiki OAuth [labs/private] - 10https://gerrit.wikimedia.org/r/911862 (owner: 10Slyngshede) [07:27:51] (03CR) 10Elukey: [C: 03+2] services: add kafka-logging100[12] to network rules and broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/911872 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [07:28:11] I can see the changes on beta 👍 [07:28:40] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:911311|Beta-Wikidata: Enable Labels in Wikidata edit summaries (T327062)]] (duration: 07m 48s) [07:28:46] T327062: Show entity labels in parsed edit summaries in API requests as well - https://phabricator.wikimedia.org/T327062 [07:29:37] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10fgiunchedi) >>! In T334785#8802395, @wiki_willy wrote: > @RobH - can you work with @fgiunchedi on this? This ties back to T310266, when the alert was first rolled out. But if you're able to ssh in and it continues to alert, I'm think... [07:29:44] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40870/console" [puppet] - 10https://gerrit.wikimedia.org/r/911852 (owner: 10Slyngshede) [07:30:54] (03PS1) 10Marostegui: site.pp: These hosts are bullseye [puppet] - 10https://gerrit.wikimedia.org/r/912227 [07:32:08] (03PS2) 10Marostegui: site.pp: These hosts are bullseye [puppet] - 10https://gerrit.wikimedia.org/r/912227 [07:32:37] (03CR) 10Marostegui: [C: 03+2] site.pp: These hosts are bullseye [puppet] - 10https://gerrit.wikimedia.org/r/912227 (owner: 10Marostegui) [07:32:57] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: sync [07:32:57] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: sync [07:33:50] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: sync [07:33:54] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: sync [07:34:46] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: sync [07:35:01] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: sync [07:35:49] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: add ferm rule for certbot on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/911765 (https://phabricator.wikimedia.org/T335161) (owner: 10Jelto) [07:36:27] (03PS1) 10Muehlenhoff: package_builder: Remove some outdated special case handling for adding sec repo [puppet] - 10https://gerrit.wikimedia.org/r/912230 [07:36:45] (03CR) 10Elukey: [C: 03+2] "Keith everything deployed, you can proceed anytime :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/911872 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [07:37:45] (03CR) 10MVernon: [C: 03+2] swift: add new nodes, drain old nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/911779 (https://phabricator.wikimedia.org/T335278) (owner: 10MVernon) [07:39:22] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [07:39:27] !log start to load new swift backends, drain old ones T335278 T335279 T335280 T335281 [07:39:27] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [07:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:33] T335279: Bring ms-be107[2-5] into the rings - https://phabricator.wikimedia.org/T335279 [07:39:33] T335278: Bring ms-be207[0-3] into the rings - https://phabricator.wikimedia.org/T335278 [07:39:33] T335281: Drain and then decommission ms-be10[40-43] - https://phabricator.wikimedia.org/T335281 [07:39:34] T335280: Drain and then decommission ms-be20[40-43] - https://phabricator.wikimedia.org/T335280 [07:41:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [07:41:34] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [07:43:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/911852 (owner: 10Slyngshede) [07:45:07] (03PS2) 10Slyngshede: P:IDM Enable Wikimedia Global Account linking. [puppet] - 10https://gerrit.wikimedia.org/r/911852 [07:46:15] (03CR) 10Slyngshede: P:IDM Enable Wikimedia Global Account linking. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911852 (owner: 10Slyngshede) [07:47:24] (03CR) 10Slyngshede: [C: 03+2] P:IDM Enable Wikimedia Global Account linking. [puppet] - 10https://gerrit.wikimedia.org/r/911852 (owner: 10Slyngshede) [07:52:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [07:57:54] (03PS1) 10Muehlenhoff: Temporarily stop using udebs from unstable [puppet] - 10https://gerrit.wikimedia.org/r/912232 (https://phabricator.wikimedia.org/T330495) [08:00:03] (03CR) 10Muehlenhoff: [C: 03+2] Temporarily stop using udebs from unstable [puppet] - 10https://gerrit.wikimedia.org/r/912232 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [08:00:04] jeena and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230426T0800). [08:00:11] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [08:00:19] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest... [08:12:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [08:12:23] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [08:12:26] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [08:12:31] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [08:13:13] (03CR) 10FNegri: [C: 03+1] toolforge: Use shard name 'toolsdb' in profile::wmcs::services::toolsdb_* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909397 (https://phabricator.wikimedia.org/T334925) (owner: 10BryanDavis) [08:13:15] (03CR) 10FNegri: [C: 03+2] toolforge: Use shard name 'toolsdb' in profile::wmcs::services::toolsdb_* [puppet] - 10https://gerrit.wikimedia.org/r/909397 (https://phabricator.wikimedia.org/T334925) (owner: 10BryanDavis) [08:13:43] (03PS5) 10Clément Goubert: sre.discovery.datacenter: exclude services not in production [cookbooks] - 10https://gerrit.wikimedia.org/r/911894 (https://phabricator.wikimedia.org/T335341) [08:14:21] (03CR) 10Clément Goubert: sre.discovery.datacenter: exclude services not in production (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/911894 (https://phabricator.wikimedia.org/T335341) (owner: 10Clément Goubert) [08:16:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [08:17:00] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [08:17:06] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [08:17:11] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [08:17:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [08:17:53] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [08:17:59] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [08:18:05] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [08:22:06] (03PS1) 10Urbanecm: dewiki: Deploy Growth features to 100% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912233 (https://phabricator.wikimedia.org/T335385) [08:22:37] (03CR) 10Urbanecm: [C: 04-2] "not yet; to be deployed on 2023-05-01" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912233 (https://phabricator.wikimedia.org/T335385) (owner: 10Urbanecm) [08:22:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [08:22:50] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [08:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:36:50] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [08:39:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [08:40:52] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 32934 [08:41:26] (03CR) 10Clément Goubert: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/912235 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [08:41:46] (03CR) 10Kosta Harlan: [C: 03+1] dewiki: Deploy Growth features to 100% of newcomers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912233 (https://phabricator.wikimedia.org/T335385) (owner: 10Urbanecm) [08:42:10] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [08:42:18] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10fgiunchedi) 05Open→03Resolved ACLs updated, and I'm optimistically resolving this task (and related to mgmt in PoPs) [08:42:22] (03PS1) 10Clément Goubert: Revert "debug.json: List primary DC servers first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911792 [08:42:38] (03PS1) 10Slyngshede: C:IDM Setup email notification. [puppet] - 10https://gerrit.wikimedia.org/r/912236 (https://phabricator.wikimedia.org/T320808) [08:42:58] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [08:44:00] 10ops-ulsfo, 10DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T335293 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving per https://phabricator.wikimedia.org/T309979#8806830 [08:44:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335298 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving per https://phabricator.wikimedia.org/T309979#8806830 [08:46:05] (03CR) 10Slyngshede: [C: 03+2] C:IDM Setup email notification. [puppet] - 10https://gerrit.wikimedia.org/r/912236 (https://phabricator.wikimedia.org/T320808) (owner: 10Slyngshede) [08:46:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host deploy2002.codfw.wmnet [08:47:03] (03CR) 10Urbanecm: [C: 04-2] dewiki: Deploy Growth features to 100% of newcomers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912233 (https://phabricator.wikimedia.org/T335385) (owner: 10Urbanecm) [08:47:19] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: add backup type failover [puppet] - 10https://gerrit.wikimedia.org/r/911759 (https://phabricator.wikimedia.org/T330771) (owner: 10Jelto) [08:47:31] (03CR) 10David Caro: [C: 03+1] "👍 awesome" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/911953 (owner: 10BryanDavis) [08:51:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 32934 [08:51:43] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:52:12] (03CR) 10Slyngshede: "recheck" [software/bitu] - 10https://gerrit.wikimedia.org/r/909959 (owner: 10Slyngshede) [08:52:18] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [08:52:24] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [08:53:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy2002.codfw.wmnet [08:53:20] (03CR) 10Marostegui: [C: 03+1] "Checked the eqiad masters, all good." [dns] - 10https://gerrit.wikimedia.org/r/912235 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [08:53:20] !log installing golang-1.11 security updates [08:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:02] (03PS1) 10JMeybohm: Test yaml anchors in hiera [puppet] - 10https://gerrit.wikimedia.org/r/912238 [08:57:39] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [08:57:54] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Enable emailing for signup and password reset [software/bitu] - 10https://gerrit.wikimedia.org/r/909959 (owner: 10Slyngshede) [08:58:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove jessie and stretch image configuration [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/911953 (owner: 10BryanDavis) [09:05:22] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2139.codfw.wmnet with reason: T335396 [09:05:28] T335396: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 [09:05:35] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2139.codfw.wmnet with reason: T335396 [09:06:14] (03PS1) 10Jelto: gitlab: add backup type failover to script output [puppet] - 10https://gerrit.wikimedia.org/r/912239 (https://phabricator.wikimedia.org/T330771) [09:06:50] (03PS1) 10Elukey: role::builder: add ml-runner user [puppet] - 10https://gerrit.wikimedia.org/r/912240 (https://phabricator.wikimedia.org/T333009) [09:07:54] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40876/console" [puppet] - 10https://gerrit.wikimedia.org/r/912240 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [09:08:25] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40875/console" [puppet] - 10https://gerrit.wikimedia.org/r/912239 (https://phabricator.wikimedia.org/T330771) (owner: 10Jelto) [09:09:42] (03CR) 10Ladsgroup: [C: 03+1] "Checked the new ones. They are correct" [dns] - 10https://gerrit.wikimedia.org/r/912171 (https://phabricator.wikimedia.org/T327920) (owner: 10Marostegui) [09:10:55] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40877/console" [puppet] - 10https://gerrit.wikimedia.org/r/912240 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [09:11:14] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: delete migrated eventgate alerts [puppet] - 10https://gerrit.wikimedia.org/r/908917 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [09:12:35] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: reword description to remove double-negative [alerts] - 10https://gerrit.wikimedia.org/r/908879 (owner: 10Cwhite) [09:12:43] (03PS2) 10Elukey: amd-gpu-tester: reduce image size [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/910743 (https://phabricator.wikimedia.org/T333009) [09:13:09] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: ensure several dashboards plugins are absent [puppet] - 10https://gerrit.wikimedia.org/r/908884 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [09:13:14] 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10jcrespo) a:05jcrespo→03wiki_willy @wiki_willy @Papaul Can we get a replacement DIMM? The urgency is that our guess is that warranty lasts until today. CC @KOfori [09:15:11] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add support for syncing data between Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [09:15:37] (03CR) 10Filippo Giunchedi: [C: 03+1] team-dcops: Add or clause for older node-exporter versions [alerts] - 10https://gerrit.wikimedia.org/r/911778 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [09:19:30] !log btullis@deploy1002 Started deploy [analytics/refinery@571f955]: Regular analytics weekly train [analytics/refinery@571f955] [09:20:16] !log btullis@deploy1002 Finished deploy [analytics/refinery@571f955]: Regular analytics weekly train [analytics/refinery@571f955] (duration: 00m 46s) [09:22:56] (03PS3) 10Elukey: amd-gpu-tester: reduce image size [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/910743 (https://phabricator.wikimedia.org/T333009) [09:24:51] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [09:25:51] !log btullis@deploy1002 Started deploy [analytics/refinery@571f955] (thin): Regular analytics weekly train THIN [analytics/refinery@571f955] [09:25:57] !log btullis@deploy1002 Finished deploy [analytics/refinery@571f955] (thin): Regular analytics weekly train THIN [analytics/refinery@571f955] (duration: 00m 05s) [09:26:09] !log btullis@deploy1002 Started deploy [analytics/refinery@571f955] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@571f955] [09:26:10] (03PS1) 10Slyngshede: C:idm::jobs Enable notification queue. [puppet] - 10https://gerrit.wikimedia.org/r/912242 (https://phabricator.wikimedia.org/T320808) [09:26:13] !log btullis@deploy1002 Finished deploy [analytics/refinery@571f955] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@571f955] (duration: 00m 04s) [09:27:41] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable http->https in varnish on cp4044,cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/911867 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [09:31:09] !log restarting varnish on cp4044 and cp4052 to drop port 80 - T322774 [09:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:40] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable http->https in haproxy on cp4044,cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/911868 (owner: 10Vgutierrez) [09:33:46] (03PS3) 10Vgutierrez: hiera: Enable http->https in haproxy on cp4044,cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/911868 [09:34:34] (03PS4) 10Vgutierrez: hiera: Enable http->https in haproxy on cp4044,cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/911868 (https://phabricator.wikimedia.org/T322774) [09:35:01] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [09:35:20] ^^ expected [09:35:28] (host is currently depooled) [09:37:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: deployment_server stanzas [puppet] - 10https://gerrit.wikimedia.org/r/911886 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [09:38:21] 10SRE-swift-storage, 10Thumbor, 10Platform Team Workboards (Platform Engineering Reliability), 10SVG: SVG rasterizer renders non Latin text as tofu glyph randomly - https://phabricator.wikimedia.org/T335271 (10hnowlan) [09:38:39] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp4052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 128 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:40:03] PROBLEM - Disk space on ms-be2071 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/objects0 0 MB (0% inode=31%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2071&var-datasource=codfw+prometheus/ops [09:42:52] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/910743 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [09:45:45] (03CR) 10Volans: "did a quick pass, see comments inline" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [09:48:51] (03PS2) 10Alexandros Kosiaris: services_proxy: Add machinetranslation [puppet] - 10https://gerrit.wikimedia.org/r/911887 (https://phabricator.wikimedia.org/T331505) [09:49:03] (03CR) 10David Caro: [C: 03+1] "LGTM, I was thinking that there might be a more lightway way of getting that info, but did not find anything on the management api we coul" [puppet] - 10https://gerrit.wikimedia.org/r/911927 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [09:49:46] !log btullis@cumin1001 Added views for new wiki: kbdwiktionary T333270 [09:49:46] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [09:49:52] T333270: Prepare and check storage layer for kbdwiktionary - https://phabricator.wikimedia.org/T333270 [09:50:34] (03CR) 10FNegri: k8s: Allow loading relative paths on kubeconfig certs (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/911880 (owner: 10David Caro) [09:53:32] (03PS1) 10Alexandros Kosiaris: services_proxy: Comment port re-use [puppet] - 10https://gerrit.wikimedia.org/r/912244 [09:54:22] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [09:54:26] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [09:55:48] (03PS1) 10Jbond: ceph_disk: sript everything for good measure [puppet] - 10https://gerrit.wikimedia.org/r/912245 [09:56:21] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [09:56:44] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [09:57:17] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [09:57:52] (03PS1) 10Alexandros Kosiaris: machinetranslation: Fix requests vs limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/912246 (https://phabricator.wikimedia.org/T331505) [09:58:01] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10klausman) [09:58:13] (03CR) 10Btullis: [C: 03+1] "Many thanks." [puppet] - 10https://gerrit.wikimedia.org/r/912245 (owner: 10Jbond) [09:58:36] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10klausman) [09:59:03] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [09:59:40] (03CR) 10Jbond: [C: 03+2] ceph_disk: sript everything for good measure [puppet] - 10https://gerrit.wikimedia.org/r/912245 (owner: 10Jbond) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230426T1000) [10:05:47] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 1828 [10:07:20] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [10:08:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:10:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 1828 [10:13:26] (03PS1) 10Vgutierrez: hiera: Disable http->https in varnish on cp4043,cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/912248 (https://phabricator.wikimedia.org/T322774) [10:13:35] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:16:28] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp4043,cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/912249 (https://phabricator.wikimedia.org/T322774) [10:18:23] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: add backup type failover to script output [puppet] - 10https://gerrit.wikimedia.org/r/912239 (https://phabricator.wikimedia.org/T330771) (owner: 10Jelto) [10:20:55] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable http->https in varnish on cp4043,cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/912248 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [10:25:04] !log restarting varnish on cp4043 and cp4051 to drop port 80 - T322774 [10:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:30] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [10:27:02] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp4043,cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/912249 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [10:27:08] (03PS2) 10Vgutierrez: hiera: Enable http->https in haproxy on cp4043,cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/912249 (https://phabricator.wikimedia.org/T322774) [10:28:52] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp4051 is CRITICAL: connect to address 10.128.0.37 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [10:29:02] (03CR) 10Arturo Borrero Gonzalez: k8s: Allow loading relative paths on kubeconfig certs (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/911880 (owner: 10David Caro) [10:31:45] ^^ that's kinda stale.. haproxy is already listening on port 80 in cp4051 [10:31:56] I'm wondering why we are only getting that alert for the upload cluster [10:35:05] (03CR) 10Filippo Giunchedi: "Overall LGTM, PCC fails though: https://puppet-compiler.wmflabs.org/output/910889/40879/webperf1003.eqiad.wmnet/change.webperf1003.eqiad.w" [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [10:35:26] (03PS1) 10Jbond: puppet-common.sh: update to use standard yaml [puppet] - 10https://gerrit.wikimedia.org/r/912251 [10:39:17] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:40:22] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: confd_prometheus_metrics.service,ferm.service,prometheus-nic-firmware-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:41:13] (03CR) 10Filippo Giunchedi: "Idea LGTM, see inline re: new job" [puppet] - 10https://gerrit.wikimedia.org/r/910083 (https://phabricator.wikimedia.org/T335027) (owner: 10Cwhite) [10:41:38] (03CR) 10Kosta Harlan: [C: 03+1] dewiki: Deploy Growth features to 100% of newcomers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912233 (https://phabricator.wikimedia.org/T335385) (owner: 10Urbanecm) [10:43:07] (03CR) 10Slyngshede: [C: 03+2] C:idm::jobs Enable notification queue. [puppet] - 10https://gerrit.wikimedia.org/r/912242 (https://phabricator.wikimedia.org/T320808) (owner: 10Slyngshede) [10:44:14] (03CR) 10Filippo Giunchedi: "LGTM, though PCC fails https://puppet-compiler.wmflabs.org/output/910856/40880/webperf1003.eqiad.wmnet/change.webperf1003.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [10:44:18] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:44:29] !log btullis@deploy1002 Started deploy [analytics/refinery@571f955]: Regular analytics weekly train [analytics/refinery@571f955] [10:45:10] (03CR) 10Arturo Borrero Gonzalez: k8s: Allow loading relative paths on kubeconfig certs (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/911880 (owner: 10David Caro) [10:48:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: gobblin-webrequest.service,produce_canary_events.service,refine_netflow.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:53] !log btullis@deploy1002 Finished deploy [analytics/refinery@571f955]: Regular analytics weekly train [analytics/refinery@571f955] (duration: 05m 23s) [10:50:19] !log btullis@deploy1002 Started deploy [analytics/refinery@571f955] (thin): Regular analytics weekly train THIN [analytics/refinery@571f955] [10:52:27] !log btullis@deploy1002 Finished deploy [analytics/refinery@571f955] (thin): Regular analytics weekly train THIN [analytics/refinery@571f955] (duration: 02m 08s) [10:52:36] !log btullis@deploy1002 Started deploy [analytics/refinery@571f955] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@571f955] [10:54:05] !log btullis@deploy1002 Finished deploy [analytics/refinery@571f955] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@571f955] (duration: 01m 30s) [10:54:25] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/906066 (owner: 10Jbond) [10:54:44] (03CR) 10Effie Mouzeli: [C: 04-1] "PCC fails due to a change in parent change" [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [10:56:50] (03PS1) 10Slyngshede: C:idm load MediaWiki social auth module. [puppet] - 10https://gerrit.wikimedia.org/r/912256 [10:57:08] (03CR) 10FNegri: k8s: Allow loading relative paths on kubeconfig certs (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/911880 (owner: 10David Caro) [10:57:34] (03CR) 10Slyngshede: [C: 03+2] C:idm load MediaWiki social auth module. [puppet] - 10https://gerrit.wikimedia.org/r/912256 (owner: 10Slyngshede) [10:58:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 5.014% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:01:08] PROBLEM - Check whether ferm is active by checking the default input chain on sretest1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:02:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/912251 (owner: 10Jbond) [11:02:55] (03PS1) 10Hnowlan: Install noto fonts by default [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912258 (https://phabricator.wikimedia.org/T335271) [11:04:32] 10SRE-swift-storage, 10Thumbor, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability), 10SVG: SVG rasterizer renders non Latin text as tofu glyph randomly - https://phabricator.wikimedia.org/T335271 (10hnowlan) The inconsistency here is between old Thumbor and thumbor-k8s - ol... [11:07:44] (03PS1) 10Slyngshede: C:idm Configure social pipeline for MediaWiki auth. [puppet] - 10https://gerrit.wikimedia.org/r/912263 [11:11:06] (03CR) 10Muehlenhoff: Install noto fonts by default (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912258 (https://phabricator.wikimedia.org/T335271) (owner: 10Hnowlan) [11:12:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:03] (03CR) 10Muehlenhoff: Install noto fonts by default (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912258 (https://phabricator.wikimedia.org/T335271) (owner: 10Hnowlan) [11:15:37] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [11:16:29] !log import php-excimer 1.0.2-1+wmf3+buster1+icu67 to component/icu67 T332964 [11:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:34] T332964: Upgrade php-excimer package from 1.0.4 to 1.1.1 - https://phabricator.wikimedia.org/T332964 [11:24:34] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [11:25:41] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [11:26:14] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [11:26:30] (03PS1) 10Hnowlan: pin libmagickore-6 and libmagickwand-6 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912267 (https://phabricator.wikimedia.org/T334863) [11:27:18] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [11:27:25] 10SRE-swift-storage, 10Thumbor, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability), 10SVG: SVG rasterizer renders non Latin text as tofu glyph randomly (as thumbor-k8s lack noto fonts) - https://phabricator.wikimedia.org/T335271 (10Aklapper) [11:29:13] (03CR) 10David Caro: k8s: Allow loading relative paths on kubeconfig certs (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/911880 (owner: 10David Caro) [11:29:35] (03PS2) 10David Caro: k8s: Allow loading relative paths on kubeconfig certs [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/911880 [11:33:09] (03PS2) 10Hnowlan: Install noto fonts by default [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912258 (https://phabricator.wikimedia.org/T335271) [11:33:38] (03CR) 10Hnowlan: Install noto fonts by default (032 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912258 (https://phabricator.wikimedia.org/T335271) (owner: 10Hnowlan) [11:34:26] (03CR) 10Hnowlan: "recheck" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) (owner: 10Hnowlan) [11:34:37] (03PS6) 10ArielGlenn: WME refresh token api uses a post request [puppet] - 10https://gerrit.wikimedia.org/r/911897 (https://phabricator.wikimedia.org/T335368) (owner: 10Hokwelum) [11:34:48] (03PS3) 10David Caro: k8s: Allow loading relative paths on kubeconfig certs [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/911880 [11:34:57] (03CR) 10David Caro: [V: 03+1] "Tested on tools" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/911880 (owner: 10David Caro) [11:36:09] (03CR) 10Jbond: [C: 03+2] team-dcops: Add or clause for older node-exporter versions [alerts] - 10https://gerrit.wikimedia.org/r/911778 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [11:36:28] (03CR) 10Jbond: [C: 03+2] spicerack: install python3-aiohttp [puppet] - 10https://gerrit.wikimedia.org/r/906066 (owner: 10Jbond) [11:36:31] (03PS7) 10ArielGlenn: WME refresh token api uses a post request [puppet] - 10https://gerrit.wikimedia.org/r/911897 (https://phabricator.wikimedia.org/T335368) (owner: 10Hokwelum) [11:36:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912258 (https://phabricator.wikimedia.org/T335271) (owner: 10Hnowlan) [11:37:21] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [11:37:49] (03CR) 10Muehlenhoff: pin libmagickore-6 and libmagickwand-6 (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912267 (https://phabricator.wikimedia.org/T334863) (owner: 10Hnowlan) [11:38:20] (03CR) 10JMeybohm: [C: 03+1] docker-report: Exclude more stretch base images [puppet] - 10https://gerrit.wikimedia.org/r/911784 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [11:38:22] (03Merged) 10jenkins-bot: team-dcops: Add or clause for older node-exporter versions [alerts] - 10https://gerrit.wikimedia.org/r/911778 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [11:45:44] (03CR) 10FNegri: [C: 03+1] k8s: Allow loading relative paths on kubeconfig certs (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/911880 (owner: 10David Caro) [11:49:41] 10ops-eqiad: InterfaceSpeedError - https://phabricator.wikimedia.org/T335403 (10phaultfinder) [11:50:57] (03CR) 10Muehlenhoff: [C: 03+2] docker-report: Exclude more stretch base images [puppet] - 10https://gerrit.wikimedia.org/r/911784 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [11:52:34] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [11:57:59] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:18] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10JMeybohm) While refactoring kubernetes puppet code I came across the fact that we place credentials t... [12:03:00] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@93a04bd] (releasing): (no justification provided) [12:03:37] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@93a04bd] (releasing): (no justification provided) (duration: 00m 34s) [12:04:26] (03PS4) 10David Caro: k8s: Allow loading relative paths on kubeconfig certs [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/911880 [12:09:16] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:28] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@93a04bd] (releasing): (no justification provided) [12:10:43] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@93a04bd] (releasing): (no justification provided) (duration: 01m 15s) [12:10:52] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@93a04bd] (releasing): (no justification provided) [12:11:25] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@93a04bd] (releasing): (no justification provided) (duration: 00m 33s) [12:13:02] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@93a04bd] (releasing): (no justification provided) [12:13:38] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@93a04bd] (releasing): (no justification provided) (duration: 00m 36s) [12:27:02] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:08] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:29:50] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:00] 10SRE, 10API Platform, 10ChangeProp, 10EventStreams, and 5 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10VirginiaPoundstone) [12:31:15] 10SRE, 10API Platform, 10serviceops, 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10VirginiaPoundstone) [12:35:32] (03CR) 10David Caro: profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [12:35:53] (03CR) 10David Caro: [C: 04-1] profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [12:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:36:40] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:31] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host cp5013.mgmt.eqsin.wmnet with reboot policy FORCED [12:41:49] (03CR) 10Jbond: [C: 03+2] puppet-common.sh: update to use standard yaml [puppet] - 10https://gerrit.wikimedia.org/r/912251 (owner: 10Jbond) [12:43:24] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/912230 (owner: 10Muehlenhoff) [12:43:51] (03CR) 10Jbond: [C: 03+1] sre.discovery.datacenter: exclude services not in production [cookbooks] - 10https://gerrit.wikimedia.org/r/911894 (https://phabricator.wikimedia.org/T335341) (owner: 10Clément Goubert) [12:46:44] (03CR) 10Muehlenhoff: java/openjdk-11 - base on debian bullsye (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911905 (owner: 10Ottomata) [12:49:01] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [12:49:04] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [12:49:34] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [12:49:46] (03CR) 10Muehlenhoff: [C: 03+2] package_builder: Remove some outdated special case handling for adding sec repo [puppet] - 10https://gerrit.wikimedia.org/r/912230 (owner: 10Muehlenhoff) [12:51:29] (03CR) 10Ottomata: java/openjdk-11 - base on debian bullsye (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911905 (owner: 10Ottomata) [12:52:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [12:52:24] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [12:54:05] (03PS2) 10Ottomata: java/openjdk-11 - base on debian bullsye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911905 [12:55:52] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp5013.mgmt.eqsin.wmnet with reboot policy FORCED [12:56:09] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [12:57:39] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:00:06] Deploy window UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230426T1300) [13:00:06] No Gerrit patches in the queue for this window AFAICS. [13:00:27] (03PS1) 10Alexandros Kosiaris: mesh: Fix a mess with trimming ending whitespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/912283 [13:00:29] (03CR) 10Kamila Součková: [C: 03+1] "LGTM other than Moritz's comment" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912267 (https://phabricator.wikimedia.org/T334863) (owner: 10Hnowlan) [13:02:46] (03PS1) 10Vgutierrez: hiera: Disable http->https in varnish on cp4042,cp4050 [puppet] - 10https://gerrit.wikimedia.org/r/912285 (https://phabricator.wikimedia.org/T322774) [13:02:52] (03PS3) 10Ottomata: java/openjdk-11 - base on debian bullsye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911905 [13:03:08] (03CR) 10Ottomata: java/openjdk-11 - base on debian bullsye (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911905 (owner: 10Ottomata) [13:03:53] (03Abandoned) 10JMeybohm: Test yaml anchors in hiera [puppet] - 10https://gerrit.wikimedia.org/r/912238 (owner: 10JMeybohm) [13:03:57] (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: exclude services not in production [cookbooks] - 10https://gerrit.wikimedia.org/r/911894 (https://phabricator.wikimedia.org/T335341) (owner: 10Clément Goubert) [13:04:18] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp4042,cp4050 [puppet] - 10https://gerrit.wikimedia.org/r/912286 (https://phabricator.wikimedia.org/T322774) [13:04:38] (03CR) 10JMeybohm: [V: 03+1] "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [13:05:49] (03PS4) 10Andrew Bogott: rabbitmq: add a single-purpose metric to detect network partition [puppet] - 10https://gerrit.wikimedia.org/r/911927 (https://phabricator.wikimedia.org/T335304) [13:06:12] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [13:06:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989 (10CDanis) Checking in about this again, as it'd be useful for intern project work. Even just being able to use it on part of the f... [13:06:40] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [13:06:43] (03Merged) 10jenkins-bot: sre.discovery.datacenter: exclude services not in production [cookbooks] - 10https://gerrit.wikimedia.org/r/911894 (https://phabricator.wikimedia.org/T335341) (owner: 10Clément Goubert) [13:09:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [13:09:51] (03CR) 10Andrew Bogott: rabbitmq: add a single-purpose metric to detect network partition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911927 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [13:09:58] (03CR) 10Andrew Bogott: [C: 03+2] rabbitmq: add a single-purpose metric to detect network partition [puppet] - 10https://gerrit.wikimedia.org/r/911927 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [13:10:31] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable http->https in varnish on cp4042,cp4050 [puppet] - 10https://gerrit.wikimedia.org/r/912285 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [13:13:00] (03CR) 10Kamila Součková: [C: 03+1] "recheck" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/909212 (owner: 10Hnowlan) [13:13:25] !log restarting varnish on cp4042 and cp4050 to drop port 80 - T322774 [13:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:55] (03PS2) 10Vgutierrez: hiera: Enable http->https in haproxy on cp4042,cp4050 [puppet] - 10https://gerrit.wikimedia.org/r/912286 (https://phabricator.wikimedia.org/T322774) [13:13:59] !log Locking scap for datacenter switchback - T327920 [13:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:04] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [13:14:50] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp4042,cp4050 [puppet] - 10https://gerrit.wikimedia.org/r/912286 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [13:15:43] !log cgoubert@deploy1002 Locking from deployment [ALL REPOSITORIES]: Datacenter Switchback - T327920 [13:15:52] PROBLEM - Varnish HTTP text-frontend - port 80 on cp4042 is CRITICAL: connect to address 10.128.0.29 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [13:17:07] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [13:17:23] stale alert.. haproxy is already handling port 80 there [13:18:12] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:19:02] RECOVERY - Varnish HTTP text-frontend - port 80 on cp4042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 128 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:22:18] 10SRE, 10Traffic, 10API Platform (RESTbase Deprecation Roadmap): Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423 (10VirginiaPoundstone) [13:22:38] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10RESTbase Sunsetting, and 4 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10VirginiaPoundstone) [13:23:27] !log Starting mediawiki datacenter switchback preparation - T327920 [13:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:34] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [13:24:09] (03CR) 10Noa wmde: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912290 (https://phabricator.wikimedia.org/T308062) (owner: 10Noa wmde) [13:24:15] (03CR) 10JMeybohm: [C: 03+1] role::builder: add ml-runner user [puppet] - 10https://gerrit.wikimedia.org/r/912240 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [13:24:38] 10SRE, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 5 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10VirginiaPoundstone) [13:25:20] 10SRE, 10Traffic, 10API Platform (API Platform Roadmap), 10Discovery-Search (Current work): Generic strategy to deal with high volume / expensive traffic from cloud providers - https://phabricator.wikimedia.org/T326782 (10VirginiaPoundstone) [13:25:20] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [13:25:23] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [13:25:27] 10SRE, 10serviceops, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10VirginiaPoundstone) [13:25:33] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [13:26:18] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::builder: add ml-runner user [puppet] - 10https://gerrit.wikimedia.org/r/912240 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [13:27:15] (03PS1) 10Andrew Bogott: detect_rabbit_partition: make executable [puppet] - 10https://gerrit.wikimedia.org/r/912291 (https://phabricator.wikimedia.org/T335304) [13:27:22] (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: reduce image size [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/910743 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [13:27:44] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:28] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [13:29:40] (03CR) 10Andrew Bogott: [C: 03+2] detect_rabbit_partition: make executable [puppet] - 10https://gerrit.wikimedia.org/r/912291 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [13:29:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Fix requests vs limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/912246 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [13:30:38] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:31:11] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [13:31:19] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches [13:33:43] RECOVERY - Disk space on ms-be2071 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2071&var-datasource=codfw+prometheus/ops [13:35:02] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches (exit_code=0) [13:35:31] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks [13:35:39] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) [13:35:40] (03Merged) 10jenkins-bot: machinetranslation: Fix requests vs limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/912246 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [13:35:45] GO/NOGO for stopping maintenance scripts ? [13:35:46] ACKNOWLEDGEMENT - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service MVernon known issue with container consistency https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:00] If I don't get a NO GO, I'm stopping them at 13:45 [13:39:02] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [13:39:10] (03PS1) 10Vgutierrez: hiera: Disable http->https in varnish on cp4041,cp4049 [puppet] - 10https://gerrit.wikimedia.org/r/912296 (https://phabricator.wikimedia.org/T322774) [13:39:20] (03CR) 10BBlack: [C: 03+1] "LGTM, thanks for picking this up! I'm pretty sure puppet won't restart pybal for us, so we can test gradually via manual restarts in ulsf" [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [13:39:36] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable http->https in varnish on cp4041,cp4049 [puppet] - 10https://gerrit.wikimedia.org/r/912296 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [13:39:42] (03PS1) 10KartikMistry: machinetranslation: Fix gunicorn workers setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/912298 [13:40:31] (03PS2) 10Hnowlan: pin libmagickore-6 and libmagickwand-6 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912267 (https://phabricator.wikimedia.org/T334863) [13:42:13] (03CR) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [13:43:33] !log restarting varnish on cp4041 and cp4049 to drop port 80 - T322774 [13:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:39] akosiaris, bblack, marostegui, Amir1, last call for NO GO before stopping maintenance scripts [13:43:59] claime: all fine from my side [13:44:08] 👀 [13:44:20] (03CR) 10Herron: [C: 03+1] prometheus: Add support for syncing data between Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [13:44:54] (03CR) 10Majavah: profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [13:45:16] !log Stopping maintenance scripts for datacenter switchback - T327920 [13:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:22] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp4041,cp4049 [puppet] - 10https://gerrit.wikimedia.org/r/912299 (https://phabricator.wikimedia.org/T322774) [13:45:23] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [13:45:29] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [13:45:34] (03CR) 10Hnowlan: [C: 03+2] Install noto fonts by default [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912258 (https://phabricator.wikimedia.org/T335271) (owner: 10Hnowlan) [13:45:44] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=99) [13:45:47] (03CR) 10Kamila Součková: pin libmagickore-6 and libmagickwand-6 (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912267 (https://phabricator.wikimedia.org/T334863) (owner: 10Hnowlan) [13:45:49] (03CR) 10Herron: [C: 03+1] nit: fix missing space in desc [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/911939 (owner: 10Ryan Kemper) [13:46:25] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp4041,cp4049 [puppet] - 10https://gerrit.wikimedia.org/r/912299 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [13:46:46] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [13:47:01] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=99) [13:47:17] Checking issue with maintenance script stop [13:47:22] (03CR) 10Herron: "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/911872 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [13:48:23] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp4049 is CRITICAL: connect to address 10.128.0.24 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [13:48:41] what does it say? [13:50:06] (03CR) 10Majavah: profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [13:50:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 136106 [13:50:37] The systemctl list units command fails [13:51:09] (03PS3) 10Hnowlan: pin libmagickore-6 and libmagickwand-6 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912267 (https://phabricator.wikimedia.org/T334863) [13:51:20] (03PS1) 10Elukey: amd-gpu-tester: add libelf-dev to the package list [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/912300 (https://phabricator.wikimedia.org/T333009) [13:51:20] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 136106 [13:51:44] note that the peering cookbook are a no brainer, but let me know if I should postpone them to limit the noise in here [13:51:47] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp4049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 128 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:51:55] (03CR) 10Hnowlan: pin libmagickore-6 and libmagickwand-6 (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912267 (https://phabricator.wikimedia.org/T334863) (owner: 10Hnowlan) [13:52:17] (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: add libelf-dev to the package list [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/912300 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [13:52:30] claime: if you're not already there, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/spicerack/mediawiki.py#237 - status 255 means there's some unit still enabled, looking to see what it is [13:53:00] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 1239 [13:53:31] I see no mediawiki_jobcgoubert@deploy2002:~$ sudo systemctl list-units 'mediawiki_job_*' --all [13:53:33] 0 loaded units listed. [13:53:39] Same on deploy1002 [13:53:47] mwmaint2002 [13:53:57] Ah yeah ofc [13:54:08] there are five growthexperiments-refreshLinkRecommendations jobs and a startupregistrystats unit, all in status failed [13:54:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 1239 [13:54:28] reset failed, rerun? [13:54:30] so "failed" explains why list-units is still showing them, but it doesn't mean they're about to start rerunning [13:54:35] (03Merged) 10jenkins-bot: Install noto fonts by default [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912258 (https://phabricator.wikimedia.org/T335271) (owner: 10Hnowlan) [13:54:38] yeah agreed [13:55:10] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [13:55:17] we should probably figure out what went on with those timers, but that can wait, I'll open a task [13:55:22] ack [13:55:41] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [13:55:48] Looking good [13:55:54] 👍 [13:55:56] 5 minute warning for read-only [13:56:00] 👍 [13:56:04] You have time for GO/NOGO [13:56:54] (and obviously I got rung by a delivery driver and had to shout to my SO "I'm in the middle of an intervention, can you go" lol) [13:57:12] Outside world intrusion [13:57:19] !sing [13:57:19] Never gonna give you up [13:57:20] Never gonna let you down [13:57:21] Never gonna run around and desert you [13:57:22] Never gonna make you cry [13:57:22] Never gonna say goodbye [13:57:26] Never gonna tell a lie and hurt you [13:57:34] Thanks sirenbot [13:57:38] Truly a voice for the ages [13:57:45] oped and functional apparently [13:57:49] why does that feature exist [13:57:54] there is a reason my oncall ends at 16:00 [13:58:02] (in three minutes) [13:58:03] taavi: why do all easter eggs exist ? [13:58:03] critical self-test functionality [13:58:15] It's a siren, it should sing [13:58:16] (03PS1) 10Muehlenhoff: Add safe.directory config to make Refinery deploys compatible with CVE-2022-24765 fix [puppet] - 10https://gerrit.wikimedia.org/r/912301 [13:58:25] but yeah, I used it to make sure it's fully functional [13:59:10] Last call before RO [13:59:25] 🍀 [13:59:28] get it! [13:59:49] !log Going to read-only for mediawiki datacenter switchback - T327920 [13:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:55] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [14:00:02] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [14:00:02] !log cgoubert@cumin1001 MediaWiki read-only period starts at: 2023-04-26 14:00:01.264329 [14:00:06] claime: gettimeofday() says it's time for Datacenter Switchback - MediaWiki. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230426T1400) [14:00:11] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:00:14] XD [14:00:19] I always like that one [14:00:25] (03CR) 10CI reject: [V: 04-1] Add safe.directory config to make Refinery deploys compatible with CVE-2022-24765 fix [puppet] - 10https://gerrit.wikimedia.org/r/912301 (owner: 10Muehlenhoff) [14:00:25] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [14:00:25] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:00:48] very useful bot [14:01:15] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [14:01:17] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:01:31] LTE fail [14:01:41] oh no [14:01:45] LTE? [14:01:54] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) [14:01:55] claime's connection [14:01:56] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:01:58] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [14:01:59] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:02:03] I'm back [14:02:06] Gonna ruin my stats ffs [14:02:07] oh [14:02:10] ok [14:02:42] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [14:02:42] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:02:42] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [14:02:42] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:02:44] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [14:02:46] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:02:46] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [14:02:47] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:02:56] edits are back [14:02:56] I hear sounds [14:03:00] We're out [14:03:01] eswiki looks good [14:03:01] weewooo [14:03:02] !log cgoubert@cumin1001 MediaWiki read-only period ends at: 2023-04-26 14:03:01.527715 [14:03:02] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [14:03:05] I can edit eswiki [14:03:16] 3 minutes with a connection frop [14:03:16] drop* [14:03:18] claime: you were about fifteen seconds away from me assuming you were dead and continuing on your behalf :P [14:03:28] Well 2 connection drops [14:03:31] yeah, that [14:03:35] * claime breathes [14:03:44] commons looks good too [14:03:57] fr read ok [14:04:00] claime has done a great job for the switchover again, we should make them do it every day [14:04:02] spike of appserver 5xxs but subsiding [14:04:07] 3 mins 1 seconds [14:04:16] Going to restart envoys [14:04:16] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:04:24] given the connection thing, this is pretty good [14:04:25] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners [14:04:27] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners (exit_code=0) [14:05:00] !log Restarting maintenance jobs - T327920 [14:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:04] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [14:05:04] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [14:06:00] claime: want me to close out the statuspage incident? [14:06:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911905 (owner: 10Ottomata) [14:06:12] rzl: go ahead [14:06:16] ack doing [14:07:16] done [14:07:17] (03PS2) 10Muehlenhoff: Make Refinery deploys compatible with CVE-2022-24765 fix [puppet] - 10https://gerrit.wikimedia.org/r/912301 [14:07:36] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [14:07:42] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl [14:08:15] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) [14:08:36] (03CR) 10Clément Goubert: [C: 03+2] db: Switch dns master alias to eqiad [dns] - 10https://gerrit.wikimedia.org/r/912235 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [14:08:46] 🍿 [14:08:59] !log Phase 9.5 Update DNS records for new database masters [14:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:10] claime: I am going to update the ones for parsercache [14:09:14] that 500 spike is resolved, appservers are still looking good [14:09:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:09:22] (03CR) 10Ayounsi: [C: 03+1] prometheus::ops: add demo node exporter job for SONiC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910083 (https://phabricator.wikimedia.org/T335027) (owner: 10Cwhite) [14:09:33] marostegui: Running authdns update now, all yours once it's done [14:10:08] marostegui: go [14:10:12] claime: will do [14:10:24] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters [14:11:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/912301 (owner: 10Muehlenhoff) [14:12:16] Banner looks gone [14:12:16] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10Papaul) Create Dispatch: Success You have successfully submitted request SR166997440. [14:13:27] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [14:14:39] (03PS1) 10Marostegui: wmnet: Update parsercache CNAME [dns] - 10https://gerrit.wikimedia.org/r/912302 (https://phabricator.wikimedia.org/T327920) [14:14:45] (03Abandoned) 10Marostegui: wmnet: Update parsercache CNAME [dns] - 10https://gerrit.wikimedia.org/r/912171 (https://phabricator.wikimedia.org/T327920) (owner: 10Marostegui) [14:14:52] (03CR) 10Kamila Součková: [C: 03+1] "LGTM" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912267 (https://phabricator.wikimedia.org/T334863) (owner: 10Hnowlan) [14:14:54] 10SRE, 10MediaWiki-Authentication-and-authorization, 10MediaWiki-User-login-and-signup, 10MediaWiki-extensions-CentralAuth, and 2 others: Account creation attempt on mobile Wikipedia domain leads user to desktop Special:CentralLogin/complete, often in logged-out st... - https://phabricator.wikimedia.org/T335125 [14:15:07] 10SRE, 10serviceops: Investigate failed maintenance jobs discovered during DC switchback - https://phabricator.wikimedia.org/T335409 (10RLazarus) [14:15:25] 10SRE, 10serviceops: Investigate failed maintenance jobs discovered during DC switchback - https://phabricator.wikimedia.org/T335409 (10RLazarus) [14:15:33] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10RLazarus) [14:15:41] (03CR) 10Marostegui: [C: 03+2] wmnet: Update parsercache CNAME [dns] - 10https://gerrit.wikimedia.org/r/912302 (https://phabricator.wikimedia.org/T327920) (owner: 10Marostegui) [14:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:16:00] !log Update dns for parsercache T327920 [14:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:05] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [14:16:30] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Marostegui) [14:19:39] (03CR) 10Hnowlan: [C: 03+2] pin libmagickore-6 and libmagickwand-6 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912267 (https://phabricator.wikimedia.org/T334863) (owner: 10Hnowlan) [14:21:01] Puppet taking its sweet sweet time running on db masters [14:21:42] claime: do you want me to reboot them? it'll make it go faster [14:21:56] lolno [14:21:57] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: sre.discovery.datacenter breaks on services not in "production" state - https://phabricator.wikimedia.org/T335341 (10Clement_Goubert) 05Open→03Resolved [14:21:58] is ok [14:22:01] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [14:23:40] (03CR) 10Kamila Součková: svg: use rsvg-convert output flag (032 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) (owner: 10Hnowlan) [14:23:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.discovery.datacenter: exclude services not in production [cookbooks] - 10https://gerrit.wikimedia.org/r/911894 (https://phabricator.wikimedia.org/T335341) (owner: 10Clément Goubert) [14:24:41] I'm gonna go ahead and unlock scap while it's running puppet because it IS taking a while [14:24:46] !log cgoubert@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchback - T327920 (duration: 69m 03s) [14:24:52] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [14:25:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cgoubert@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911792 (owner: 10Clément Goubert) [14:25:30] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [14:26:05] (03Merged) 10jenkins-bot: Revert "debug.json: List primary DC servers first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911792 (owner: 10Clément Goubert) [14:26:30] !log cgoubert@deploy1002 Started scap: Backport for [[gerrit:911792|Revert "debug.json: List primary DC servers first"]] [14:27:35] (03Merged) 10jenkins-bot: pin libmagickore-6 and libmagickwand-6 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/912267 (https://phabricator.wikimedia.org/T334863) (owner: 10Hnowlan) [14:28:05] (03CR) 10David Caro: [C: 04-1] profile:toolforge:harbor: setup blackbox monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [14:28:07] !log cgoubert@deploy1002 cgoubert: Backport for [[gerrit:911792|Revert "debug.json: List primary DC servers first"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:28:47] claime: stepping away for a little bit if you don't need me, nice job! [14:29:03] rzl: I'm good thanks, ttyl [14:29:07] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add support for syncing data between Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/909738 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [14:29:08] <3 [14:30:31] (03CR) 10Filippo Giunchedi: prometheus::ops: add demo node exporter job for SONiC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910083 (https://phabricator.wikimedia.org/T335027) (owner: 10Cwhite) [14:30:54] (03CR) 10Hnowlan: [C: 03+2] Minor formatting changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/909212 (owner: 10Hnowlan) [14:31:25] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) [14:32:09] (03CR) 10Cwhite: prometheus::ops: add demo node exporter job for SONiC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910083 (https://phabricator.wikimedia.org/T335027) (owner: 10Cwhite) [14:33:01] (03PS1) 10Clément Goubert: Revert "wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002" [dns] - 10https://gerrit.wikimedia.org/r/911794 [14:33:08] (03CR) 10CI reject: [V: 04-1] Revert "wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002" [dns] - 10https://gerrit.wikimedia.org/r/911794 (owner: 10Clément Goubert) [14:33:10] (03CR) 10Kamila Součková: svg: use rsvg-convert output flag (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) (owner: 10Hnowlan) [14:33:33] (03PS1) 10Vgutierrez: haproxy: Disable http->https in varnish on cp4040,cp4048 [puppet] - 10https://gerrit.wikimedia.org/r/912304 (https://phabricator.wikimedia.org/T322774) [14:34:02] (03CR) 10Majavah: "My only concern is that we don't have a newer replacement tcl image. Otherwise LGTM." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/911953 (owner: 10BryanDavis) [14:34:08] (03CR) 10Lucas Werkmeister (WMDE): wmf-update-known-hosts-production: Automatically download DNS (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [14:34:34] (03PS1) 10RobH: install to secure erase cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/912305 (https://phabricator.wikimedia.org/T330313) [14:34:37] !log cgoubert@deploy1002 Finished scap: Backport for [[gerrit:911792|Revert "debug.json: List primary DC servers first"]] (duration: 08m 07s) [14:34:46] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Disable http->https in varnish on cp4040,cp4048 [puppet] - 10https://gerrit.wikimedia.org/r/912304 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [14:35:35] (03CR) 10RobH: [C: 03+2] install to secure erase cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/912305 (https://phabricator.wikimedia.org/T330313) (owner: 10RobH) [14:36:19] (03PS2) 10Clément Goubert: Revert "wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002" [dns] - 10https://gerrit.wikimedia.org/r/911794 [14:38:08] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp4040,cp4048 [puppet] - 10https://gerrit.wikimedia.org/r/912306 (https://phabricator.wikimedia.org/T322774) [14:38:12] (03CR) 10Clément Goubert: [C: 03+2] Revert "wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002" [dns] - 10https://gerrit.wikimedia.org/r/911794 (owner: 10Clément Goubert) [14:38:14] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) 05Open→03Resolved [14:38:21] 10SRE, 10serviceops, 10Datacenter-Switchover: Investigate failed maintenance jobs discovered during DC switchback - https://phabricator.wikimedia.org/T335409 (10Aklapper) [14:39:12] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp4040,cp4048 [puppet] - 10https://gerrit.wikimedia.org/r/912306 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [14:40:38] (03PS1) 10Cathal Mooney: Change Homer template to get license key from custom field [homer/public] - 10https://gerrit.wikimedia.org/r/912307 (https://phabricator.wikimedia.org/T334180) [14:41:03] !log restarting varnish on cp4040 and cp4048 to drop port 80 - T322774 [14:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:20] (03Merged) 10jenkins-bot: Minor formatting changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/909212 (owner: 10Hnowlan) [14:44:27] 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10RobH) [14:45:10] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cp5013.eqsin.wmnet with OS bullseye [14:45:19] 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp5013.eqsin.wmnet with OS bullseye [14:45:29] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Clement_Goubert) Everything went great, thanks for your support! [14:45:34] (03PS1) 10Thiemo Kreuz (WMDE): Hide wrong "this reference is used 0 times" in citation dialog [extensions/Cite] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/911796 (https://phabricator.wikimedia.org/T241885) [14:45:54] (03PS1) 10Thiemo Kreuz (WMDE): Hide wrong "this reference is used 0 times" in citation dialog [extensions/Cite] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/911797 (https://phabricator.wikimedia.org/T241885) [14:46:24] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [14:46:44] 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert) 05In progress→03Resolved [14:47:15] 10SRE, 10serviceops, 10Datacenter-Switchover: Investigate failed maintenance jobs discovered during DC switchback - https://phabricator.wikimedia.org/T335409 (10Clement_Goubert) [14:47:18] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [14:47:24] (03CR) 10Michael Große: [C: 04-1] "This is ready for review, but deployment should be coordinated with product" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912309 (https://phabricator.wikimedia.org/T335098) (owner: 10Michael Große) [14:48:06] (03CR) 10Volans: "replies inline" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [14:48:27] (03PS1) 10Urbanecm: [Growth] Remove config variables provided by extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912310 [14:48:58] (03CR) 10Urbanecm: [C: 04-2] dewiki: Deploy Growth features to 100% of newcomers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912233 (https://phabricator.wikimedia.org/T335385) (owner: 10Urbanecm) [14:50:44] (03CR) 10Majavah: profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [14:50:47] (03CR) 10Lucas Werkmeister (WMDE): wmf-update-known-hosts-production: Automatically download DNS (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [14:52:09] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10User-notice-archive: Replace Liberation 1 fonts with Liberation 2 for svg rendering - https://phabricator.wikimedia.org/T253600 (10JoKalliauer) This update might trigged a difference in rendering between [[https://svgcheck.toolforge.org/index.php|SVG-check]] v... [14:54:21] (03Abandoned) 10Jforrester: Revert "build: Remove pinning of indirect lcobucci/jwt dependency" [extensions/OAuth] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900144 (https://phabricator.wikimedia.org/T321160) (owner: 10Jforrester) [14:55:53] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912309 (https://phabricator.wikimedia.org/T335098) (owner: 10Michael Große) [14:56:57] (03PS1) 10Muehlenhoff: Revert "sre.hosts.reimage/sre.ganeti.reimage: Delete Puppet state file before reimage" [cookbooks] - 10https://gerrit.wikimedia.org/r/912311 (https://phabricator.wikimedia.org/T330495) [14:57:33] (03CR) 10Volans: wmf-update-known-hosts-production: Automatically download DNS (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [14:58:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 4.922% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:58:14] (03PS1) 10BBlack: Varnish/ATS semicolon workaround for Restbase [puppet] - 10https://gerrit.wikimedia.org/r/912312 (https://phabricator.wikimedia.org/T238285) [14:59:46] (03PS3) 10Lucas Werkmeister (WMDE): wmf-update-known-hosts-production: Automatically download DNS [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 [14:59:51] (03CR) 10Lucas Werkmeister (WMDE): wmf-update-known-hosts-production: Automatically download DNS (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [15:00:31] (03PS1) 10Elukey: amd-gpu-tester: add librdm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/912313 (https://phabricator.wikimedia.org/T333009) [15:01:16] (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: add librdm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/912313 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [15:02:10] (03PS1) 10Muehlenhoff: sretest: Don't include opentelemetry-collector on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/912314 [15:03:03] (03PS1) 10Effie Mouzeli: hieradata: add secrets for excimer perf site [labs/private] - 10https://gerrit.wikimedia.org/r/912315 (https://phabricator.wikimedia.org/T291015) [15:03:11] (03PS1) 10Ottomata: hdfs_tools - Remove reference to non existent profile::analytics::hdfs_tools [puppet] - 10https://gerrit.wikimedia.org/r/912316 (https://phabricator.wikimedia.org/T317167) [15:03:20] (03CR) 10Ottomata: [C: 03+2] java/openjdk-11 - base on debian bullsye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911905 (owner: 10Ottomata) [15:04:33] (03PS2) 10Ottomata: hdfs_tools - Remove reference to non existent profile::analytics::hdfs_tools [puppet] - 10https://gerrit.wikimedia.org/r/912316 (https://phabricator.wikimedia.org/T317167) [15:04:35] (03PS10) 10Effie Mouzeli: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [15:05:18] (03CR) 10Effie Mouzeli: [C: 03+1] hieradata: add secrets for excimer perf site [labs/private] - 10https://gerrit.wikimedia.org/r/912315 (https://phabricator.wikimedia.org/T291015) (owner: 10Effie Mouzeli) [15:05:30] (03CR) 10Muehlenhoff: [C: 03+2] sretest: Don't include opentelemetry-collector on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/912314 (owner: 10Muehlenhoff) [15:05:32] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] hieradata: add secrets for excimer perf site [labs/private] - 10https://gerrit.wikimedia.org/r/912315 (https://phabricator.wikimedia.org/T291015) (owner: 10Effie Mouzeli) [15:07:18] (03CR) 10CI reject: [V: 04-1] hdfs_tools - Remove reference to non existent profile::analytics::hdfs_tools [puppet] - 10https://gerrit.wikimedia.org/r/912316 (https://phabricator.wikimedia.org/T317167) (owner: 10Ottomata) [15:13:42] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cp5014.eqsin.wmnet with OS bullseye [15:13:47] 10SRE, 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp5014.eqsin.wmnet with OS bullseye [15:14:15] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cp5015.eqsin.wmnet with OS bullseye [15:14:21] 10SRE, 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp5015.eqsin.wmnet with OS bullseye [15:14:46] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cp5016.eqsin.wmnet with OS bullseye [15:14:51] 10SRE, 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp5016.eqsin.wmnet with OS bullseye [15:18:10] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:49] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: host reimage [15:18:54] (03CR) 10David Caro: [C: 04-1] profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [15:19:35] (03CR) 10BryanDavis: Remove jessie and stretch image configuration (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/911953 (owner: 10BryanDavis) [15:19:56] !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@5061681]: (no justification provided) [15:20:17] !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@5061681]: (no justification provided) (duration: 00m 20s) [15:21:57] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5013.eqsin.wmnet with reason: host reimage [15:22:08] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:29] (03PS1) 10Elukey: amd-gpu-tester: add libdrm-amdgpu1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/912319 [15:28:56] (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: add libdrm-amdgpu1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/912319 (owner: 10Elukey) [15:30:31] (03PS2) 10Effie Mouzeli: coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [15:31:17] (03PS3) 10Ottomata: hdfs_tools - Remove reference to non existent profile::analytics::hdfs_tools [puppet] - 10https://gerrit.wikimedia.org/r/912316 (https://phabricator.wikimedia.org/T317167) [15:31:37] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/910856/40887/" [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [15:31:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bullseye [15:32:05] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye [15:32:08] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989 (10MoritzMuehlenhoff) >>! In T299989#8807364, @CDanis wrote: > Checking in about this again, as it'd be useful for intern project wo... [15:34:07] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [15:34:15] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest... [15:34:24] (03CR) 10Ottomata: [C: 03+2] hdfs_tools - Remove reference to non existent profile::analytics::hdfs_tools [puppet] - 10https://gerrit.wikimedia.org/r/912316 (https://phabricator.wikimedia.org/T317167) (owner: 10Ottomata) [15:35:02] (03CR) 10Ottomata: [V: 03+2 C: 03+2] java/openjdk-11 - base on debian bullsye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911905 (owner: 10Ottomata) [15:36:18] (03PS3) 10Muehlenhoff: Make Refinery deploys compatible with CVE-2022-24765 fix [puppet] - 10https://gerrit.wikimedia.org/r/912301 [15:38:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2002.codfw.wmnet with reason: host reimage [15:40:52] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:41:14] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/912312 (https://phabricator.wikimedia.org/T238285) (owner: 10BBlack) [15:41:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2002.codfw.wmnet with reason: host reimage [15:43:16] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [15:43:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/912301 (owner: 10Muehlenhoff) [15:43:43] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5014.eqsin.wmnet with reason: host reimage [15:43:49] (03PS2) 10Hnowlan: svg: use rsvg-convert output flag [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) [15:44:15] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5015.eqsin.wmnet with reason: host reimage [15:44:16] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5016.eqsin.wmnet with reason: host reimage [15:45:09] (03PS1) 10Elukey: amd-gpu-tester: add libnuma1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/912325 [15:45:37] (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: add libnuma1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/912325 (owner: 10Elukey) [15:46:04] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:56] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5014.eqsin.wmnet with reason: host reimage [15:47:28] (03CR) 10BryanDavis: Remove jessie and stretch image configuration (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/911953 (owner: 10BryanDavis) [15:47:51] (03Abandoned) 10Dzahn: gerrit: make configurable whether service is running [puppet] - 10https://gerrit.wikimedia.org/r/911941 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [15:49:20] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5015.eqsin.wmnet with reason: host reimage [15:51:39] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5016.eqsin.wmnet with reason: host reimage [15:52:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [15:52:33] 7~/27 [15:52:37] err :) [15:53:07] (03PS11) 10Effie Mouzeli: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [15:53:09] (03PS3) 10Effie Mouzeli: coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [15:53:37] (03CR) 10Michael Große: "Note that there is an almost identical change for puppet. Not sure which is the right place, or if we need both." [deployment-charts] - 10https://gerrit.wikimedia.org/r/912326 (https://phabricator.wikimedia.org/T225778) (owner: 10Michael Große) [15:53:57] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Ccoxwell - https://phabricator.wikimedia.org/T335150 (10colewhite) [15:54:01] (03CR) 10JMeybohm: [C: 03+1] "I would say we're good to go" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904226 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [15:54:16] (03PS1) 10Cwhite: grafana: raise metadata fetch error [puppet] - 10https://gerrit.wikimedia.org/r/911842 (https://phabricator.wikimedia.org/T335413) [15:54:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:54:37] PROBLEM - Host mc2046 is DOWN: PING CRITICAL - Packet loss = 100% [15:55:48] (03CR) 10Michael Große: "Note that there is an almost identical change for deployment-charts. Not sure which is the right place, or if we need both" [puppet] - 10https://gerrit.wikimedia.org/r/912327 (https://phabricator.wikimedia.org/T225778) (owner: 10Michael Große) [15:56:45] RECOVERY - Host mc2046 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [15:56:53] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Ccoxwell - https://phabricator.wikimedia.org/T335150 (10colewhite) Updated wmf group ldap filters to use `uid=cec` because `uid=ccoxwell` does not exist. [15:56:56] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10jcrespo) Thank you Papaul for the quick reaction- I will leave the host up and running for now. [15:57:37] (03PS4) 10Effie Mouzeli: coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [15:57:39] (03CR) 10JMeybohm: [C: 03+1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [15:58:19] (03CR) 10JMeybohm: [C: 03+1] Remove remaining obsolete nodejs images only used on Stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911761 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [15:58:24] (03PS3) 10Hnowlan: svg: use rsvg-convert output flag [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) [15:58:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] k8s: Allow loading relative paths on kubeconfig certs [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/911880 (owner: 10David Caro) [15:59:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:59:53] (03CR) 10Hnowlan: svg: use rsvg-convert output flag (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) (owner: 10Hnowlan) [16:00:49] (03PS1) 10Joal: Fix profile::analytics::cluster::client [puppet] - 10https://gerrit.wikimedia.org/r/912330 [16:01:52] (03PS3) 10Dzahn: mariadb::generic_server: change default datadir path [puppet] - 10https://gerrit.wikimedia.org/r/909788 (https://phabricator.wikimedia.org/T329571) [16:02:07] ottomata, moritzm --^ see patch above [16:03:12] joal: already did, see chat in #wikimedia-analytics! [16:03:14] https://gerrit.wikimedia.org/r/c/operations/puppet/+/912316 [16:03:28] (03CR) 10SBassett: [C: 03+1] "(from a security perspective)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910815 (https://phabricator.wikimedia.org/T67750) (owner: 10Gergő Tisza) [16:04:06] Oh no I missed that ottomata - sorry for the ping moritzm and ottomata - And thank you ottomata for the quick fix [16:05:05] (03Abandoned) 10Joal: Fix profile::analytics::cluster::client [puppet] - 10https://gerrit.wikimedia.org/r/912330 (owner: 10Joal) [16:05:49] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [16:07:55] (03CR) 10BBlack: [C: 03+2] Varnish/ATS semicolon workaround for Restbase [puppet] - 10https://gerrit.wikimedia.org/r/912312 (https://phabricator.wikimedia.org/T238285) (owner: 10BBlack) [16:08:11] (03PS1) 10Andrew Bogott: detect_rabbit_partition: fix metric name and tag [puppet] - 10https://gerrit.wikimedia.org/r/912331 (https://phabricator.wikimedia.org/T335304) [16:08:55] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [16:10:49] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [16:12:58] (03CR) 10Dzahn: "the only production user is role parsoid/testreduce and that sets an explicit value for this. all other users are the cloud VPS projects w" [puppet] - 10https://gerrit.wikimedia.org/r/909788 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [16:13:44] (03PS1) 10Vgutierrez: hiera: disable http->https in varnish on cp4039,cp4047 [puppet] - 10https://gerrit.wikimedia.org/r/912332 (https://phabricator.wikimedia.org/T322774) [16:16:29] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] nit: fix missing space in desc [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/911939 (owner: 10Ryan Kemper) [16:16:44] (03CR) 10Vgutierrez: [C: 03+2] hiera: disable http->https in varnish on cp4039,cp4047 [puppet] - 10https://gerrit.wikimedia.org/r/912332 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [16:17:05] !log restarting varnish on cp4039 and cp4047 to drop port 80 - T322774 [16:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:39] (03CR) 10David Caro: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/912331 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [16:17:41] (03CR) 10Andrew Bogott: [C: 03+2] detect_rabbit_partition: fix metric name and tag [puppet] - 10https://gerrit.wikimedia.org/r/912331 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [16:18:31] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Ccoxwell - https://phabricator.wikimedia.org/T335150 (10colewhite) [16:20:06] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp4039,cp4047 [puppet] - 10https://gerrit.wikimedia.org/r/912333 (https://phabricator.wikimedia.org/T322774) [16:21:58] (03CR) 10Dzahn: "per https://phabricator.wikimedia.org/T335150#8808138 this uid does not exist" [puppet] - 10https://gerrit.wikimedia.org/r/911277 (https://phabricator.wikimedia.org/T335150) (owner: 10Marostegui) [16:22:05] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp4039,cp4047 [puppet] - 10https://gerrit.wikimedia.org/r/912333 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [16:22:15] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:13] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms [16:25:01] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Ccoxwell - https://phabricator.wikimedia.org/T335150 (10Dzahn) 05Resolved→03Open confirmed. cec exists, ccoxwell does not. [16:27:44] (03CR) 10Krinkle: svg: use rsvg-convert output flag (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) (owner: 10Hnowlan) [16:29:19] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [16:29:20] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5016.eqsin.wmnet with OS bullseye [16:29:25] 10SRE, 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp5016.eqsin.wmnet with OS bullseye completed: - cp5016 (**PASS**) - Removed from Puppet and Pupp... [16:29:31] !log robh@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [16:29:32] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5015.eqsin.wmnet with OS bullseye [16:29:36] 10SRE, 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp5015.eqsin.wmnet with OS bullseye completed: - cp5015 (**WARN**) - Removed from Puppet and Pupp... [16:29:49] !log robh@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [16:29:49] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5013.eqsin.wmnet with OS bullseye [16:29:54] 10SRE, 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp5013.eqsin.wmnet with OS bullseye completed: - cp5013 (**WARN**) - Removed from Puppet and Pupp... [16:29:54] !log robh@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [16:29:55] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5014.eqsin.wmnet with OS bullseye [16:29:58] (03PS1) 10Elukey: profile::amd_gpu: add support for the K8s device plugin on DSE [puppet] - 10https://gerrit.wikimedia.org/r/912336 (https://phabricator.wikimedia.org/T333009) [16:29:59] 10SRE, 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp5014.eqsin.wmnet with OS bullseye completed: - cp5014 (**WARN**) - Removed from Puppet and Pupp... [16:31:36] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:31:46] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "double checked every single instance listed in compiler output as using this. no change of the path anywhere.. either it stays default or " [puppet] - 10https://gerrit.wikimedia.org/r/909788 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [16:31:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40895/console" [puppet] - 10https://gerrit.wikimedia.org/r/912336 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [16:32:08] (03CR) 10Elukey: [C: 03+2] profile::amd_gpu: add support for the K8s device plugin on DSE [puppet] - 10https://gerrit.wikimedia.org/r/912336 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [16:32:54] (03PS12) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) [16:33:10] (03CR) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [16:33:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:33:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2002.codfw.wmnet with OS bullseye [16:33:46] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye completed: - sretest2002 (**WARN**) - Removed from Puppet and PuppetDB if... [16:34:32] thanks mutante for catching that [16:34:46] did you fix data.yaml too or should I? [16:34:55] (as the task was reopened) [16:35:12] marostegui: it still needs the fix in data.yaml [16:35:26] reopened so we don't forget that [16:35:33] ok let me fix it [16:35:36] thank you [16:36:30] just wanted to get something else done first I had started. ty [16:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:36:38] !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts cp5013 [16:38:01] 10SRE-tools, 10Infrastructure-Foundations: Create an offline cookbook to take care of additional offline steps - https://phabricator.wikimedia.org/T335431 (10Volans) p:05Triage→03Medium [16:38:14] 10SRE-tools, 10Infrastructure-Foundations: Create an offline cookbook to take care of additional offline steps - https://phabricator.wikimedia.org/T335431 (10Volans) [16:38:18] marostegui: I did a change that could look scary to the casual observer. changing the default datadir for mariadb::generic_server :) But I want to emphasize it has nothing to do with prod mariadb :) [16:38:18] 10SRE-tools, 10Infrastructure-Foundations: redfish: minimum version support - https://phabricator.wikimedia.org/T328593 (10Volans) [16:38:33] but I was just imagining how bad that could be if it was, heh [16:43:04] (03PS1) 10DCausse: labtestwiki: disable cirrus completion index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912337 [16:44:51] !log robh@cumin1001 START - Cookbook sre.dns.netbox [16:46:48] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5013 decommissioned, removing all IPs except the asset tag one - robh@cumin1001" [16:47:33] mutante: haha ok ok [16:48:15] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5013 decommissioned, removing all IPs except the asset tag one - robh@cumin1001" [16:48:15] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:48:16] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp5013 [16:48:20] 10SRE, 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `cp5013` - cp5013 (**FAIL**) - Downtimed host on Icinga/Alertmanager - Found physical host - /... [16:50:54] !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts cp5014 [16:51:56] PROBLEM - IPMI Sensor Status on aqs2008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:52:30] (03PS1) 10Marostegui: data.yaml: Fix uid [puppet] - 10https://gerrit.wikimedia.org/r/912339 (https://phabricator.wikimedia.org/T335150) [16:52:38] mutante: how does that look? ^ [16:53:35] (03PS1) 10Dwisehaupt: Add mappings for new frbast and payments-listener hosts [dns] - 10https://gerrit.wikimedia.org/r/912340 (https://phabricator.wikimedia.org/T319460) [16:53:58] PROBLEM - IPMI Sensor Status on mw2330 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:55:25] (03CR) 10Volans: [C: 03+1] "Thanks for the addition, LGTM! But leaving a final wording to the current maintainers of the package" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [16:56:32] !log robh@cumin1001 START - Cookbook sre.dns.netbox [16:56:52] mutante: If that looks good to you, can you +2 and merge? I need to run now :( [16:57:18] marostegui: +1.. I mean +2, I'll merge :) cu! [16:57:25] (03CR) 10Dzahn: [C: 03+2] "uid: cec" [puppet] - 10https://gerrit.wikimedia.org/r/912339 (https://phabricator.wikimedia.org/T335150) (owner: 10Marostegui) [16:57:43] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:57:50] mutante: thanks a lot :* [16:58:02] marostegui: yw:) run!:) [16:58:12] (03PS1) 10Herron: kafkamon: add bullseye role and node assignments [puppet] - 10https://gerrit.wikimedia.org/r/912341 (https://phabricator.wikimedia.org/T335424) [16:58:37] (03CR) 10CI reject: [V: 04-1] kafkamon: add bullseye role and node assignments [puppet] - 10https://gerrit.wikimedia.org/r/912341 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [16:58:46] claime: I wonder if that keyholder alert is expected.. but I guess the inactive deploy server should not have keys armed [16:59:08] mutante: I think it's related to m.oritzm's reboot this morning [16:59:21] (03PS2) 10Herron: kafkamon: add bullseye role and node assignments [puppet] - 10https://gerrit.wikimedia.org/r/912341 (https://phabricator.wikimedia.org/T335424) [16:59:23] aha, yea, that makes sense [16:59:24] And not the switchover in itself [16:59:49] wondering if they should or should not be armed on the inactive deploy server [16:59:57] not seems safe [17:00:03] but alerts [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230426T1700) [17:00:10] (03CR) 10Volans: [C: 03+1] "Seems ok to me, but I've not tested it. We should also test it in the cloud realm to make sure it doesn't break that use case." [puppet] - 10https://gerrit.wikimedia.org/r/910059 (owner: 10Jbond) [17:00:53] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Ccoxwell - https://phabricator.wikimedia.org/T335150 (10Dzahn) 05Open→03Resolved [17:02:39] PROBLEM - IPMI Sensor Status on mw2331 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:04:36] (03CR) 10Jgreen: [C: 03+2] Add mappings for new frbast and payments-listener hosts [dns] - 10https://gerrit.wikimedia.org/r/912340 (https://phabricator.wikimedia.org/T319460) (owner: 10Dwisehaupt) [17:05:42] 10SRE-tools, 10Infrastructure-Foundations: redfish: minimum version support - https://phabricator.wikimedia.org/T328593 (10jbond) [17:06:43] (03PS3) 10Dzahn: planet: the HTTPS_PROXY itself is accessed via http [puppet] - 10https://gerrit.wikimedia.org/r/902513 [17:07:07] (03CR) 10CI reject: [V: 04-1] planet: the HTTPS_PROXY itself is accessed via http [puppet] - 10https://gerrit.wikimedia.org/r/902513 (owner: 10Dzahn) [17:08:32] (03CR) 10BryanDavis: [C: 03+1] OAuth: Do not require approval for read-only grants on public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910815 (https://phabricator.wikimedia.org/T67750) (owner: 10Gergő Tisza) [17:09:52] (03PS1) 10David Caro: prometheus::blackbox::check::http: allow passing alert data [puppet] - 10https://gerrit.wikimedia.org/r/912342 [17:10:36] 10SRE-tools, 10Infrastructure-Foundations: redfish: minimum version support - https://phabricator.wikimedia.org/T328593 (10jbond) [17:11:18] 10SRE: keyholder monitoring should not alert on inactive deployment server - https://phabricator.wikimedia.org/T335435 (10Dzahn) [17:11:35] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5014 decommissioned, removing all IPs except the asset tag one - robh@cumin1001" [17:12:53] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5014 decommissioned, removing all IPs except the asset tag one - robh@cumin1001" [17:12:53] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:12:53] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp5014 [17:12:59] 10SRE, 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `cp5014` - cp5014 (**FAIL**) - Downtimed host on Icinga/Alertmanager - Found physical host - /... [17:13:03] 10SRE: keyholder monitoring should not alert on inactive deployment server - https://phabricator.wikimedia.org/T335435 (10Dzahn) also for discussion whether the keyholder should or should not be armed on the inactive deployment server. [17:13:27] !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts cp5015 [17:14:09] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 201.6k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [17:14:47] 10SRE: keyholder monitoring should not alert on inactive deployment server - https://phabricator.wikimedia.org/T335435 (10Dzahn) [17:15:23] 10SRE, 10serviceops: keyholder monitoring should not alert on inactive deployment server - https://phabricator.wikimedia.org/T335435 (10Dzahn) [17:17:07] 10SRE-tools, 10Infrastructure-Foundations: redfish: minimum version support - https://phabricator.wikimedia.org/T328593 (10jbond) [17:17:53] 10SRE, 10serviceops: keyholder on inactive deployment server - https://phabricator.wikimedia.org/T335435 (10Dzahn) [17:18:00] !log robh@cumin1001 START - Cookbook sre.dns.netbox [17:19:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 201.6k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [17:19:09] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10AndrewTavis_WMDE) [17:19:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10AndrewTavis_WMDE) Hope that the above is in order. Please let me know if I need to do anything on my end 🙏 [17:19:39] (03CR) 10David Caro: [C: 04-1] profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [17:20:10] (03PS4) 10Dzahn: planet: the HTTPS_PROXY itself is accessed via http [puppet] - 10https://gerrit.wikimedia.org/r/902513 [17:20:43] (03PS5) 10David Caro: profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [17:21:27] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5015 decommissioned, removing all IPs except the asset tag one - robh@cumin1001" [17:22:02] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 36351 [17:23:04] (03CR) 10CI reject: [V: 04-1] profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [17:23:41] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5015 decommissioned, removing all IPs except the asset tag one - robh@cumin1001" [17:23:41] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:23:41] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp5015 [17:23:47] 10SRE, 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `cp5015` - cp5015 (**FAIL**) - Downtimed host on Icinga/Alertmanager - Found physical host - /... [17:24:40] !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts cp5016 [17:24:52] (03PS3) 10Eevans: cassandra_dev: Upgrade cluster to 'dev' version (3.11.14) [puppet] - 10https://gerrit.wikimedia.org/r/911934 (https://phabricator.wikimedia.org/T335383) [17:26:40] (03CR) 10Eevans: [C: 03+2] cassandra_dev: Upgrade cluster to 'dev' version (3.11.14) [puppet] - 10https://gerrit.wikimedia.org/r/911934 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [17:29:19] !log robh@cumin1001 START - Cookbook sre.dns.netbox [17:29:46] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Dzahn) Hi @AndrewTavis_WMDE I think we need a little clarification here since just "analytics" doesn't exist as a group for users but there are multiple "analytics-*" groups. Can yo... [17:30:37] (03PS7) 10Jbond: git-sync-upstream: add support for gituser and alternate base directories [puppet] - 10https://gerrit.wikimedia.org/r/910059 [17:30:49] 10SRE, 10SRE-Access-Requests: Requesting access to bast1003.wikimedia.org, mwmaint1002.eqiad.wmnet, and mwmaint2002.codfw.wmnet for erayfield - https://phabricator.wikimedia.org/T335438 (10ERayfield) [17:31:11] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5016 decommissioned, removing all IPs except the asset tag one - robh@cumin1001" [17:31:32] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/910059 (owner: 10Jbond) [17:32:44] 10SRE, 10SRE-Access-Requests: Requesting access to bast1003.wikimedia.org, mwmaint1002.eqiad.wmnet, and mwmaint2002.codfw.wmnet for erayfield - https://phabricator.wikimedia.org/T335438 (10SCherukuwada) Manager approves. [17:32:56] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:33:21] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/912342 (owner: 10David Caro) [17:34:34] (03PS1) 10BryanDavis: tcl86: switch base image to bullseye [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/912343 (https://phabricator.wikimedia.org/T335420) [17:35:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 36351 [17:35:57] 10SRE, 10SRE-Access-Requests: Requesting access to bast1003.wikimedia.org, mwmaint1002.eqiad.wmnet, and mwmaint2002.codfw.wmnet for erayfield - https://phabricator.wikimedia.org/T335438 (10Dzahn) Hi @ERayfield and SRE on clinic duty, I'd say the right group here is the one called "restricted". ` restricte... [17:37:11] (03CR) 10Majavah: [C: 04-1] profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [17:37:38] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5016 decommissioned, removing all IPs except the asset tag one - robh@cumin1001" [17:37:38] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:37:39] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp5016 [17:37:43] 10SRE, 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `cp5016` - cp5016 (**FAIL**) - Downtimed host on Icinga/Alertmanager - Found physical host - /... [17:46:22] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10Dzahn) [17:47:31] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10wiki_willy) Thanks @fgiunchedi ! >>! In T334785#8806834, @fgiunchedi wrote: > ACLs updated, and I'm optimistically resolving this task (and related to mgmt in PoPs) [17:47:43] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10Dzahn) This will need approval from @thcipriani as group approver for "restricted". [17:52:01] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10thcipriani) >>! In T335438#8808626, @Dzahn wrote: > This will need approval from @thcipriani as group approver for "restricted". +1 `restricted` seems like the right gro... [17:52:56] (03PS2) 10BryanDavis: tcl86: switch base image to bullseye [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/912343 (https://phabricator.wikimedia.org/T335420) [17:53:00] (03PS2) 10BryanDavis: Remove jessie and stretch image configuration [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/911953 [17:56:05] (03CR) 10BryanDavis: Remove jessie and stretch image configuration (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/911953 (owner: 10BryanDavis) [18:00:05] jeena and jnuche: That opportune time is upon us again. Time for a Train log triage with CPT deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230426T1800). [18:00:05] jeena and jnuche: gettimeofday() says it's time for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230426T1800) [18:02:41] (03CR) 10Iniquity: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911799 (https://phabricator.wikimedia.org/T335136) (owner: 10Iniquity) [18:02:55] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912344 (https://phabricator.wikimedia.org/T330212) [18:02:57] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912344 (https://phabricator.wikimedia.org/T330212) (owner: 10TrainBranchBot) [18:03:34] (03PS3) 10Iniquity: Switch on creating Babel categories in Russian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911799 (https://phabricator.wikimedia.org/T335136) [18:03:50] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912344 (https://phabricator.wikimedia.org/T330212) (owner: 10TrainBranchBot) [18:03:59] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10wiki_willy) a:03RobH @RobH - just a heads up, this should be fixed via https://phabricator.wikimedia.org/T334785#8806688. [18:07:00] (PowerSupply) firing: Power Supply - Status - issue on mw2330:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2330 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:10:23] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.6 refs T330212 [18:10:28] T330212: 1.41.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T330212 [18:14:38] PROBLEM - Check systemd state on mw2325 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:15:00] (PowerSupply) firing: Power Supply - Status - issue on aqs2008:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=aqs2008 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:15:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:15:19] (03PS1) 10Eevans: cassandra-dev: enable prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/912347 [18:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:16:27] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.6 refs T330212 (duration: 06m 04s) [18:16:32] T330212: 1.41.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T330212 [18:17:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on mw2330:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2330 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:17:24] Anyone doing something on centrallog? [18:19:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on mw2331:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2331 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:20:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:25:30] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:33:36] !log ebernhardson@deploy1002 Started deploy [search/mjolnir/deploy@ba52b43]: replace python env deployment method with conda env from gitlab [18:34:00] 10SRE-tools, 10Infrastructure-Foundations: redfish: minimum version support - https://phabricator.wikimedia.org/T328593 (10Volans) [18:34:00] !log ebernhardson@deploy1002 Finished deploy [search/mjolnir/deploy@ba52b43]: replace python env deployment method with conda env from gitlab (duration: 00m 24s) [18:34:53] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hdfs_rsync_analytics_hadoop_published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:29] (03PS1) 10BCornwall: pybal: Switch esams LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/912354 (https://phabricator.wikimedia.org/T263797) [18:58:28] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 4.824% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:59:59] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40896/console" [puppet] - 10https://gerrit.wikimedia.org/r/912354 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [19:01:18] (03PS2) 10BCornwall: pybal: Switch esams LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/912354 (https://phabricator.wikimedia.org/T263797) [19:02:17] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40897/console" [puppet] - 10https://gerrit.wikimedia.org/r/912354 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [19:10:11] !log ebernhardson@deploy1002 Started deploy [search/mjolnir/deploy@5f2ec35]: repoint shebang lines of conda env [19:10:35] !log ebernhardson@deploy1002 Finished deploy [search/mjolnir/deploy@5f2ec35]: repoint shebang lines of conda env (duration: 00m 23s) [19:15:58] !log ebernhardson@deploy1002 Started deploy [search/mjolnir/deploy@eb07d71]: fetch_conda: path globs must not be quoted [19:16:18] 10SRE-Access-Requests: Add user xcollazo to archiva-deployers LDAP group - https://phabricator.wikimedia.org/T335445 (10xcollazo) [19:16:26] !log ebernhardson@deploy1002 Finished deploy [search/mjolnir/deploy@eb07d71]: fetch_conda: path globs must not be quoted (duration: 00m 27s) [19:22:35] (03PS4) 10Ebernhardson: search: Report age of titlesuggest indices to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/911940 (https://phabricator.wikimedia.org/T327199) [19:25:12] 10ops-eqiad: InterfaceSpeedError - https://phabricator.wikimedia.org/T335403 (10Jclark-ctr) a:03Jclark-ctr [19:28:57] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10BPirkle) The old image suggestions api ([[ https://gerrit.wikimedia.org/g/mediawiki/services/image-suggestion-api | mediawiki/services/image-suggesti... [19:30:32] 10SRE-Access-Requests: Add user xcollazo to archiva-deployers LDAP group - https://phabricator.wikimedia.org/T335445 (10odimitrijevic) Yes, confirming the above. I approve the request. [19:31:45] (03PS3) 10BCornwall: pybal: Switch ulsfo LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/912354 (https://phabricator.wikimedia.org/T263797) [19:33:19] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40898/console" [puppet] - 10https://gerrit.wikimedia.org/r/912354 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [19:36:28] (03CR) 10BBlack: [C: 03+1] pybal: Switch ulsfo LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/912354 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [19:37:05] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10dancy) >>! In T288629#8807158, @JMeybohm wrote: > I don't see helm defaults being installed to releases or ci nodes since t... [19:47:37] !log Disable Puppet on LVS[4008-4010] for rollout of LVS maglev hashing scheduler - T263797 [19:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:42] T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 [19:48:20] (03CR) 10BCornwall: [V: 03+1 C: 03+2] pybal: Switch ulsfo LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/912354 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [19:50:48] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:52:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [19:53:30] PROBLEM - PyBal backends health check on lvs4010 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [19:53:42] PROBLEM - pybal on lvs4010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:55:02] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:56:25] (03CR) 10Herron: [C: 03+1] "I'm not super familiar with these dev clusters but lgtm overall" [puppet] - 10https://gerrit.wikimedia.org/r/912347 (owner: 10Eevans) [19:56:38] PROBLEM - PyBal connections to etcd on lvs4010 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [19:56:56] (03PS1) 10BCornwall: wmflib: Add Maglev Hashing (mh) to supported types [puppet] - 10https://gerrit.wikimedia.org/r/912365 (https://phabricator.wikimedia.org/T263797) [19:58:51] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40899/console" [puppet] - 10https://gerrit.wikimedia.org/r/912365 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230426T2000). [20:00:04] jdrewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:31] (03CR) 10Cwhite: [C: 03+2] remove strict ecs version gate [puppet] - 10https://gerrit.wikimedia.org/r/906702 (owner: 10Cwhite) [20:01:33] (03PS1) 10Krinkle: speed-tests: Test selector changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912366 [20:01:43] (03PS2) 10Krinkle: speed-tests: Test selector changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912366 [20:02:52] (03CR) 10Cwhite: [V: 03+2 C: 03+2] logstash: webrequest ecs: move backend to label [puppet] - 10https://gerrit.wikimedia.org/r/910077 (https://phabricator.wikimedia.org/T277816) (owner: 10Cwhite) [20:03:44] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10wiki_willy) a:05wiki_willy→03Papaul [20:05:15] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10wiki_willy) a:03RobH [20:05:32] (03PS1) 10BCornwall: ipvs: Add Maglev Hashing (mh) scheduler type [debs/pybal] - 10https://gerrit.wikimedia.org/r/912367 (https://phabricator.wikimedia.org/T263797) [20:05:42] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335294 (10wiki_willy) a:03RobH [20:11:18] Amir1: [20:11:37] Good evening what terrible thing I have done [20:12:14] lol sorry typo! [20:12:36] :D [20:13:16] (03CR) 10Ebernhardson: [C: 03+1] labtestwiki: disable cirrus completion index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912337 (owner: 10DCausse) [20:14:37] That’s me picking up my phone and realizing I’m late for the back port window, but that I’d still like to back port after everyone’s done :P [20:15:50] RECOVERY - pybal on lvs4010 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [20:20:34] PROBLEM - pybal on lvs4010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [20:22:35] (03PS1) 10BCornwall: ipvs: Add Maglev Hashing (mh) scheduler type [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/912369 (https://phabricator.wikimedia.org/T263797) [20:23:22] Well looks like I'm the only one with a patch to backport [20:23:31] jan_drewniak: I can backport if you need [20:24:01] jeena: that would be great! [20:24:48] It's just a config change https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/911952/ [20:25:37] okay cool [20:26:41] (03Abandoned) 10BCornwall: ipvs: Add Maglev Hashing (mh) scheduler type [debs/pybal] - 10https://gerrit.wikimedia.org/r/912367 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [20:26:43] and deploying with my kids running around me right now is fraught with risk :P [20:26:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911952 (https://phabricator.wikimedia.org/T335311) (owner: 10Jdrewniak) [20:27:10] haha [20:27:37] (03Merged) 10jenkins-bot: Set Vector 2022 as default skin on Polish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911952 (https://phabricator.wikimedia.org/T335311) (owner: 10Jdrewniak) [20:28:02] !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:911952|Set Vector 2022 as default skin on Polish Wikipedia (T335311)]] [20:28:07] T335311: Deploy Vector 2022 skin as the desktop default on plwiki - https://phabricator.wikimedia.org/T335311 [20:29:42] !log jhuneidi@deploy1002 jhuneidi and jdrewniak: Backport for [[gerrit:911952|Set Vector 2022 as default skin on Polish Wikipedia (T335311)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [20:29:57] jan_drewniak: are there any checks you need to do? [20:30:17] yeah I'll do a quick check now [20:30:56] jeena: ok looks good! [20:31:13] (03PS2) 10BCornwall: ipvs: Add Maglev Hashing (mh) scheduler type [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/912369 (https://phabricator.wikimedia.org/T263797) [20:31:22] okay, syncing [20:32:11] (03CR) 10Cwhite: [C: 03+2] profile: ensure several dashboards plugins are absent [puppet] - 10https://gerrit.wikimedia.org/r/908884 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [20:34:29] (03PS3) 10BCornwall: ipvs: Add Maglev Hashing (mh) scheduler type [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/912369 (https://phabricator.wikimedia.org/T263797) [20:34:31] (03PS1) 10BCornwall: Release 1.15.11 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/912372 (https://phabricator.wikimedia.org/T263797) [20:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:37:00] (03CR) 10BBlack: [C: 03+1] ipvs: Add Maglev Hashing (mh) scheduler type [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/912369 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [20:37:24] !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:911952|Set Vector 2022 as default skin on Polish Wikipedia (T335311)]] (duration: 09m 22s) [20:37:29] T335311: Deploy Vector 2022 skin as the desktop default on plwiki - https://phabricator.wikimedia.org/T335311 [20:38:55] (03PS2) 10BCornwall: Release 1.15.11 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/912372 (https://phabricator.wikimedia.org/T263797) [20:39:19] Backports finished! [20:39:45] jeena: woohoo thanks! [20:40:47] (03CR) 10BBlack: [C: 03+1] Release 1.15.11 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/912372 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [20:41:03] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/912347 (owner: 10Eevans) [20:42:54] np! [20:43:54] (03CR) 10BCornwall: [C: 03+2] ipvs: Add Maglev Hashing (mh) scheduler type [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/912369 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [20:43:59] (03CR) 10BCornwall: [C: 03+2] Release 1.15.11 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/912372 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [20:54:01] wikitech appears to be down (502, Broken pipe ) [20:54:15] fun! [20:55:00] brett: can you help bringing it back? :D [20:55:39] As best as my limited abilities allow, yes :) [20:55:42] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.40:80, 10.2.2.40:7443]) https://wikitech.wikimedia.org/wiki/PyBal [20:56:40] what recently happened? [20:56:46] something is going on [20:57:09] labweb.svc.eqiad.wmnet is borked , which I'm guessing is behind wikitech [20:57:51] ebernhardson: maybe? I see a vaguely-related patch from you above, but not merged? [20:57:54] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.40:80, 10.2.2.40:7443]) https://wikitech.wikimedia.org/wiki/PyBal [20:58:12] bblack: scap sync logged above, and i tried restarting php-fpm there manually (which might've be the cause) [20:59:01] maybe scap depooled some servers and didn't repool again? [20:59:16] https://config-master.wikimedia.org/pybal/eqiad/labweb says both hosts are enabled:false [20:59:26] yeah [20:59:26] * bd808 tries to see what's up on cloudweb1003 [20:59:35] I assume they're etcd-depooled [21:00:12] bblack@cumin1001:~$ confctl select service=labweb get [21:00:12] {"cloudweb1003.wikimedia.org": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=labweb,service=labweb"} [21:00:16] {"cloudweb1004.wikimedia.org": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=labweb,service=labweb"} [21:00:21] can I pool them back? [21:00:30] (03Abandoned) 10Ryan Kemper: elastic: Add wmf-elasticsearch-search-plugins package for bullseye [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/786376 (https://phabricator.wikimedia.org/T306911) (owner: 10Bking) [21:00:30] both cloudweb1003 and cloudweb1004 should be pooled as far as I know [21:00:42] being bold [21:00:43] there isn't anywhere else that the traffic would go [21:00:44] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: service=labweb [21:01:27] well, I still get a 502 from wikitech [21:01:40] bblack: can you pool labweb-ssl too? [21:01:41] If anyone needs it: https://wikitech-static.wikimedia.org/wiki/Main_Page [21:01:57] ah, there we go [21:02:05] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: service=labweb-ssl [21:02:16] wikitech's back for me [21:02:17] seems ok now, thanks [21:02:23] thank you, bblack :) [21:02:32] thanks! [21:02:42] my guess would be that the deploy process is still getting hung up on LVS confirmations, like we experienced before [21:02:53] only we don't expect it to be x-dc. We have a pybal down in ulsfo right now [21:03:08] oh :( [21:03:15] I'm just guessing, but maybe it polls the pybals globally? [21:03:30] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:03:31] (which seems like a very unresilient design, esp given the failure mode is silently leaving things dpooled) [21:04:26] 10SRE, 10wikitech.wikimedia.org: Wikitech serving 502s - https://phabricator.wikimedia.org/T335453 (10lmata) [21:04:49] this is the issue I'm talking about, to be clear: https://phabricator.wikimedia.org/T334703 [21:04:55] 10SRE, 10wikitech.wikimedia.org: Wikitech serving 502s - https://phabricator.wikimedia.org/T335453 (10Urbanecm) This should be fixed now (thanks @bblack!) -- can you try again? [21:06:56] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:07:01] 10SRE, 10wikitech.wikimedia.org: Wikitech serving 502s - https://phabricator.wikimedia.org/T335453 (10lmata) LGTM! thank you @BBlack and @UrbanecmTest [21:07:21] 10SRE, 10wikitech.wikimedia.org: Wikitech serving 502s - https://phabricator.wikimedia.org/T335453 (10lmata) 05Open→03Resolved a:03lmata [21:08:44] 10SRE-OnFire, 10Traffic, 10conftool, 10serviceops, 10Sustainability (Incident Followup): Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10BBlack) Probably needs subtasks for two things: 1. Fix "safe-service-restar... [21:09:42] PROBLEM - Host elastic2050 is DOWN: PING CRITICAL - Packet loss = 100% [21:10:14] RECOVERY - Host elastic2050 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms [21:10:30] (PowerSupply) resolved: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [21:25:09] (03PS1) 10BCornwall: Revert "pybal: Switch ulsfo LVS to use Maglev scheduler" [puppet] - 10https://gerrit.wikimedia.org/r/911801 [21:27:00] (03CR) 10Eevans: [C: 03+2] cassandra-dev: enable prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/912347 (owner: 10Eevans) [21:30:22] (03CR) 10Andrea Denisse: kafkamon: add bullseye role and node assignments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912341 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [21:30:49] (03CR) 10Andrea Denisse: [C: 03+1] kafkamon: add bullseye role and node assignments [puppet] - 10https://gerrit.wikimedia.org/r/912341 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [21:30:50] 10SRE, 10LDAP-Access-Requests: Add user xcollazo to archiva-deployers LDAP group - https://phabricator.wikimedia.org/T335445 (10Dzahn) [21:31:09] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40902/console" [puppet] - 10https://gerrit.wikimedia.org/r/911801 (owner: 10BCornwall) [21:36:45] (03CR) 10BCornwall: [V: 03+1 C: 03+2] Revert "pybal: Switch ulsfo LVS to use Maglev scheduler" [puppet] - 10https://gerrit.wikimedia.org/r/911801 (owner: 10BCornwall) [21:39:10] !log Re-enable Puppet on LVS[4008-4010] - T263797 [21:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:15] T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 [21:39:36] RECOVERY - PyBal backends health check on lvs4010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:39:42] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:39:46] RECOVERY - pybal on lvs4010 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [21:40:08] (03PS1) 10Eevans: Add component/cassandra41 for Cassandra 4.1.x releases [puppet] - 10https://gerrit.wikimedia.org/r/912376 (https://phabricator.wikimedia.org/T313814) [21:42:06] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2050-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:42:08] (03PS5) 10Ryan Kemper: wdqs: try alternative slo query approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/911936 (https://phabricator.wikimedia.org/T328306) [21:43:00] RECOVERY - PyBal connections to etcd on lvs4010 is OK: OK: 16 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [21:43:05] (03PS1) 10Jdlrobson: Map schema should not have side effects and map marks field [extensions/Graph] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/911802 (https://phabricator.wikimedia.org/T335335) [21:44:01] (03PS1) 10Jdlrobson: Don't mutate given schema in mapSchema() [extensions/Graph] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/912378 [21:45:01] (03PS1) 10Nray: Fix `a.image:not(.noviewer,.metadata),a.thumbimage:not(.noviewer,.metadata)' is not a valid selector` bug [extensions/MobileFrontend] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/912379 (https://phabricator.wikimedia.org/T335451) [21:47:25] (03CR) 10Bking: [C: 03+1] wdqs: try alternative slo query approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/911936 (https://phabricator.wikimedia.org/T328306) (owner: 10Ryan Kemper) [21:47:32] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: try alternative slo query approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/911936 (https://phabricator.wikimedia.org/T328306) (owner: 10Ryan Kemper) [21:52:06] (CirrusSearchNodeIndexingNotIncreasing) resolved: Elasticsearch instance elastic2050-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:53:07] jouncebot: nowandnext [21:53:07] No deployments scheduled for the next 8 hour(s) and 6 minute(s) [21:53:07] In 8 hour(s) and 6 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T0600) [21:53:07] In 8 hour(s) and 6 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T0600) [21:54:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910110 (owner: 10MusikAnimal) [21:55:50] (03Merged) 10jenkins-bot: interwiki: update URL to XTools [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910110 (owner: 10MusikAnimal) [21:56:20] !log samtar@deploy1002 Started scap: Backport for [[gerrit:910110|interwiki: update URL to XTools]] [21:57:30] TheresNoTime: i know it's belated, but CR-1. you need to update https://meta.wikimedia.org/wiki/Interwiki_map and run the generation script, otherwise your update will be rewriteen the next time someone updates iw cache. [21:57:56] !log samtar@deploy1002 musikanimal and samtar: Backport for [[gerrit:910110|interwiki: update URL to XTools]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:58:23] urbanecm: ack.. [21:58:58] fwiw, `scap update-interwiki-cache` is the script i was referring to. [21:59:22] (03PS1) 10Andrea Denisse: prometheus: Add label to prometheus3002 data blocks to prevent duplication in Thanos [puppet] - 10https://gerrit.wikimedia.org/r/912381 [21:59:41] (03CR) 10CI reject: [V: 04-1] Don't mutate given schema in mapSchema() [extensions/Graph] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/912378 (owner: 10Jdlrobson) [21:59:46] (03CR) 10CI reject: [V: 04-1] prometheus: Add label to prometheus3002 data blocks to prevent duplication in Thanos [puppet] - 10https://gerrit.wikimedia.org/r/912381 (owner: 10Andrea Denisse) [22:00:03] (03CR) 10Urbanecm: "Belated CR-1. interwiki.php is updated automatically based on https://meta.wikimedia.org/wiki/Interwiki_map; manual edits are lost wheneve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910110 (owner: 10MusikAnimal) [22:00:54] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:01:08] urbanecm: so https://meta.wikimedia.org/wiki/Special:Diff/24942859 was "all" that was needed? [22:01:35] that edit is enough for the URL to be updated eventually [22:01:52] if you want immediate update, you need to run scap update-interwiki-cache, push the resulting commit to gerrit and deploy via scap backport [22:02:40] but interwiki cache update happens reasonably frequently, and as it's a non-urgent change, i think it would be fine to leave it there until someone updates cache. but, if you want to try updating out, feel free to! [22:04:28] (03PS2) 10Andrea Denisse: prometheus: Add label to prometheus3002 data blocks to prevent data duplication Bug: T335406 [puppet] - 10https://gerrit.wikimedia.org/r/912381 (https://phabricator.wikimedia.org/T335406) [22:04:50] (03PS1) 10Ryan Kemper: wdqs: no longer need recording rule [puppet] - 10https://gerrit.wikimedia.org/r/912382 (https://phabricator.wikimedia.org/T328306) [22:04:52] (03CR) 10CI reject: [V: 04-1] prometheus: Add label to prometheus3002 data blocks to prevent data duplication Bug: T335406 [puppet] - 10https://gerrit.wikimedia.org/r/912381 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [22:06:04] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:910110|interwiki: update URL to XTools]] (duration: 09m 43s) [22:06:11] (03PS3) 10Andrea Denisse: prometheus: Add label to prometheus3002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912381 (https://phabricator.wikimedia.org/T335406) [22:06:33] (03CR) 10CI reject: [V: 04-1] prometheus: Add label to prometheus3002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912381 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [22:07:23] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: no longer need recording rule [puppet] - 10https://gerrit.wikimedia.org/r/912382 (https://phabricator.wikimedia.org/T328306) (owner: 10Ryan Kemper) [22:08:08] (03PS4) 10Andrea Denisse: prometheus: Add label to prometheus3002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912381 (https://phabricator.wikimedia.org/T335406) [22:09:32] (03CR) 10JHathaway: "Overall a really nice refactor to centralize the configs. I have a couple of initial questions based on my perusal." [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [22:10:39] (03CR) 10Andrea Denisse: [C: 03+1] kafkamon: add bullseye role and node assignments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912341 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [22:12:28] (03CR) 10MusikAnimal: interwiki: update URL to XTools (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910110 (owner: 10MusikAnimal) [22:14:40] (03PS1) 10Andrea Denisse: prometheus: Add label to prometheus4002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912383 (https://phabricator.wikimedia.org/T335406) [22:15:00] (PowerSupply) firing: Power Supply - Status - issue on aqs2008:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=aqs2008 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:16:43] (03PS1) 10Andrea Denisse: prometheus: Add label to prometheus5002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912385 (https://phabricator.wikimedia.org/T335406) [22:17:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on mw2330:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2330 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:19:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on mw2331:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2331 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:22:28] (03PS1) 10Andrea Denisse: prometheus: Add label to prometheus6001 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912407 (https://phabricator.wikimedia.org/T335406) [22:24:29] (03PS1) 10Andrea Denisse: prometheus: Add label to prometheus6002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912409 (https://phabricator.wikimedia.org/T335406) [22:32:35] (03CR) 10Dzahn: [V: 04-1 C: 04-1] "duh https://puppet-compiler.wmflabs.org/output/902513/40903/planet1002.eqiad.wmnet/change.planet1002.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/902513 (owner: 10Dzahn) [22:34:38] (03PS5) 10Dzahn: planet: the HTTPS_PROXY itself is accessed via http [puppet] - 10https://gerrit.wikimedia.org/r/902513 [22:39:15] (03PS6) 10Dzahn: planet: the HTTPS_PROXY itself is accessed via http [puppet] - 10https://gerrit.wikimedia.org/r/902513 [22:40:35] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MW-1.41-notes (1.41.0-wmf.4; 2023-04-10), 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) I checked and that's not an issue related to this ticket, the thumbnails... [22:47:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jclark-ctr) a:03Jclark-ctr [22:49:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Jclark-ctr) a:03Jclark-ctr [22:49:56] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/902513/40905/planet1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/902513 (owner: 10Dzahn) [22:51:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install puppetmaster1006 - https://phabricator.wikimedia.org/T334479 (10Jclark-ctr) a:03Jclark-ctr [22:52:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install pki-root1002 - https://phabricator.wikimedia.org/T334401 (10Jclark-ctr) a:03Jclark-ctr [23:03:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 4.729% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:08:50] jouncebot: nowandnext [23:08:50] No deployments scheduled for the next 6 hour(s) and 51 minute(s) [23:08:51] In 6 hour(s) and 51 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T0600) [23:08:51] In 6 hour(s) and 51 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T0600) [23:09:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [extensions/MobileFrontend] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/912379 (https://phabricator.wikimedia.org/T335451) (owner: 10Nray) [23:11:50] (03CR) 10Cwhite: [C: 03+1] kafkamon: add bullseye role and node assignments [puppet] - 10https://gerrit.wikimedia.org/r/912341 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [23:12:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [23:24:35] (03Merged) 10jenkins-bot: Fix `a.image:not(.noviewer,.metadata),a.thumbimage:not(.noviewer,.metadata)' is not a valid selector` bug [extensions/MobileFrontend] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/912379 (https://phabricator.wikimedia.org/T335451) (owner: 10Nray) [23:25:02] !log zabe@deploy1002 Started scap: Backport for [[gerrit:912379|Fix `a.image:not(.noviewer,.metadata),a.thumbimage:not(.noviewer,.metadata)' is not a valid selector` bug (T335451)]] [23:25:08] T335451: SyntaxError: Failed to execute 'closest' on 'Element': 'a.image:not(.noviewer,.metadata),a.thumbimage:not(.noviewer,.metadata)' is not a valid selector. - https://phabricator.wikimedia.org/T335451 [23:26:28] !log zabe@deploy1002 zabe and nray: Backport for [[gerrit:912379|Fix `a.image:not(.noviewer,.metadata),a.thumbimage:not(.noviewer,.metadata)' is not a valid selector` bug (T335451)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [23:32:09] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:912379|Fix `a.image:not(.noviewer,.metadata),a.thumbimage:not(.noviewer,.metadata)' is not a valid selector` bug (T335451)]] (duration: 07m 07s) [23:32:14] T335451: SyntaxError: Failed to execute 'closest' on 'Element': 'a.image:not(.noviewer,.metadata),a.thumbimage:not(.noviewer,.metadata)' is not a valid selector. - https://phabricator.wikimedia.org/T335451 [23:39:41] (03PS1) 10Catrope: beta: Change license from CC BY-SA 3.0 to 4.0 on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912417 (https://phabricator.wikimedia.org/T319064) [23:50:56] (03PS1) 10Zabe: Pin wgAbuseFilterActorTableSchemaMigrationStage to _COMPAT_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912419 (https://phabricator.wikimedia.org/T334295) [23:50:58] (03PS1) 10Zabe: beta: Start writing to af_actor/afh_actor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912420 (https://phabricator.wikimedia.org/T334295) [23:52:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [23:53:14] (03CR) 10Zabe: [C: 03+2] Pin wgAbuseFilterActorTableSchemaMigrationStage to _COMPAT_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912419 (https://phabricator.wikimedia.org/T334295) (owner: 10Zabe) [23:53:16] (03CR) 10Zabe: [C: 03+2] beta: Start writing to af_actor/afh_actor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912420 (https://phabricator.wikimedia.org/T334295) (owner: 10Zabe) [23:54:06] (03Merged) 10jenkins-bot: Pin wgAbuseFilterActorTableSchemaMigrationStage to _COMPAT_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912419 (https://phabricator.wikimedia.org/T334295) (owner: 10Zabe) [23:54:08] (03Merged) 10jenkins-bot: beta: Start writing to af_actor/afh_actor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912420 (https://phabricator.wikimedia.org/T334295) (owner: 10Zabe) [23:54:48] !log zabe@deploy1002 Started scap: T334295 [23:54:53] T334295: Write to af_actor/afh_actor in production - https://phabricator.wikimedia.org/T334295 [23:59:01] (03PS1) 10Krinkle: mc: Fix accidental mcrouter prefix $wgWANObjectCache on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912421 (https://phabricator.wikimedia.org/T329680)