[00:00:14] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-a3-codfw.mgmt.codfw.wmnet [00:00:16] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:01:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1149.eqiad.wmnet with OS bullseye [00:01:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye [00:02:16] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage [00:03:55] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a3-codfw - pt1979@cumin2002" [00:04:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a3-codfw - pt1979@cumin2002" [00:04:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:05:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a3-codfw.mgmt.codfw.wmnet [00:06:06] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-a4-codfw.mgmt.codfw.wmnet [00:06:08] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:09:08] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a4-codfw - pt1979@cumin2002" [00:10:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a4-codfw - pt1979@cumin2002" [00:10:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:10:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a4-codfw.mgmt.codfw.wmnet [00:11:07] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-a5-codfw.mgmt.codfw.wmnet [00:11:09] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:15:22] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a5-codfw - pt1979@cumin2002" [00:16:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a5-codfw - pt1979@cumin2002" [00:16:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:17:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a5-codfw.mgmt.codfw.wmnet [00:18:09] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:19] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-a6-codfw.mgmt.codfw.wmnet [00:18:21] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:21:36] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a6-codfw - pt1979@cumin2002" [00:22:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a6-codfw - pt1979@cumin2002" [00:22:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:23:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a6-codfw.mgmt.codfw.wmnet [00:23:59] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2003.codfw.wmnet with OS bullseye [00:24:04] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host cassandra-dev2003.codfw.wmnet with OS bullseye completed: - cassan... [00:25:37] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-a7-codfw.mgmt.codfw.wmnet [00:25:39] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:26:28] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [00:26:59] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [00:28:51] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [00:30:51] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:36:08] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [00:36:17] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) 05Open→03Resolved @cmooney @ayounsi the cabling is done.I am using port 55 on each leaf to connect to ssw1-a1 and pot 54 to connect to ssw1-a8. the l... [00:36:50] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a7-codfw - pt1979@cumin2002" [00:37:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a7-codfw - pt1979@cumin2002" [00:37:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:38:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a7-codfw.mgmt.codfw.wmnet [00:39:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930598 [00:39:31] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930598 (owner: 10TrainBranchBot) [00:42:55] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-b2-codfw.mgmt.codfw.wmnet [00:42:57] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:46:01] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b2-codfw - pt1979@cumin2002" [00:46:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b2-codfw - pt1979@cumin2002" [00:46:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:47:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-b2-codfw.mgmt.codfw.wmnet [00:48:15] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Eevans) [00:48:17] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10Eevans) 05Open→03Resolved a:03Eevans Done! macro-deployed [00:48:23] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10Eevans) [00:56:52] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-b3-codfw.mgmt.codfw.wmnet [00:56:53] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:57:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1149.eqiad.wmnet with OS bullseye [00:57:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye execute... [00:58:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930598 (owner: 10TrainBranchBot) [00:59:11] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b3-codfw - pt1979@cumin2002" [01:00:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b3-codfw - pt1979@cumin2002" [01:00:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:01:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-b3-codfw.mgmt.codfw.wmnet [01:01:40] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-b4-codfw.mgmt.codfw.wmnet [01:01:42] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [01:04:25] (03PS1) 10Eevans: cassandra: use python3 as python [puppet] - 10https://gerrit.wikimedia.org/r/930738 (https://phabricator.wikimedia.org/T310980) [01:05:18] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b4-codfw - pt1979@cumin2002" [01:06:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b4-codfw - pt1979@cumin2002" [01:06:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:07:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-b4-codfw.mgmt.codfw.wmnet [01:10:50] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-b5-codfw.mgmt.codfw.wmnet [01:10:52] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [01:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:11:27] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:12:21] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:14:45] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:14:46] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b5-codfw - pt1979@cumin2002" [01:15:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b5-codfw - pt1979@cumin2002" [01:15:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:16:17] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:16:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-b5-codfw.mgmt.codfw.wmnet [01:17:49] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-b6-codfw.mgmt.codfw.wmnet [01:17:51] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [01:20:22] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b6-codfw - pt1979@cumin2002" [01:21:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b6-codfw - pt1979@cumin2002" [01:21:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:22:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-b6-codfw.mgmt.codfw.wmnet [01:24:01] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-b7-codfw.mgmt.codfw.wmnet [01:24:03] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [01:27:46] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b7-codfw - pt1979@cumin2002" [01:28:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b7-codfw - pt1979@cumin2002" [01:28:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:29:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-b7-codfw.mgmt.codfw.wmnet [01:31:24] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-b8-codfw.mgmt.codfw.wmnet [01:31:25] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [01:35:01] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b8-codfw - pt1979@cumin2002" [01:36:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b8-codfw - pt1979@cumin2002" [01:36:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:36:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-b8-codfw.mgmt.codfw.wmnet [01:42:18] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw:basic spines/leaves configuration using ZTP - https://phabricator.wikimedia.org/T339315 (10Papaul) [01:47:41] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw:basic spines/leaves configuration using ZTP - https://phabricator.wikimedia.org/T339315 (10Papaul) 05Open→03Resolved @cmooney @ayounsi the basic config is done on all the switches using ZTP. so when added to devices.yaml you should be go... [01:48:50] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T339178 (10Papaul) 05Open→03Resolved a:03Papaul [01:50:37] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150'] [01:50:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150'] [01:56:01] (03PS1) 10RLazarus: deployment_server: Add opentelemetry-collector kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/930739 (https://phabricator.wikimedia.org/T320564) [01:56:27] (03PS1) 10RLazarus: admin_ng: Add namespace for opentelemetry-collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/930740 (https://phabricator.wikimedia.org/T320564) [02:07:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:08:59] (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [02:14:01] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:15:33] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:18:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Transfer Neil Shah-Quinn's production access to new developer account - https://phabricator.wikimedia.org/T337591 (10nshahquinn-wmf) @MatthewVernon thank you very much! Today, I started setting up the new account and testing everything out. I copied over my o... [02:22:47] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:23:43] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:27:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:56:59] (PuppetDisabled) firing: Puppet disabled on puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [03:38:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-b1-codfw.mgmt.codfw.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [03:40:17] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:44:55] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:46:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:51:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:05:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:10:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:20:41] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:22:11] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:25:19] (03PS1) 10KartikMistry: Update MinT to 2023-06-16-042302-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/930743 (https://phabricator.wikimedia.org/T339271) [05:26:51] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:28:27] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:33:42] (03PS1) 10KartikMistry: Use Parsoid for all Wikis for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930744 (https://phabricator.wikimedia.org/T339322) [05:43:29] (03PS1) 10Marostegui: control-mariadb-client-10.6-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/930745 (https://phabricator.wikimedia.org/T338918) [05:43:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:44:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:52:59] 10SRE, 10Infrastructure-Foundations, 10netops: IC-307235 down yet again - https://phabricator.wikimedia.org/T339289 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks, looks like it's back up and they sent an RFO. This circuit will be upgraded to 100G soon-ish, so probably no need for specific follow up... [05:53:52] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:58:29] (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-10.6-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/930745 (https://phabricator.wikimedia.org/T338918) (owner: 10Marostegui) [05:59:03] (03Merged) 10jenkins-bot: control-mariadb-client-10.6-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/930745 (https://phabricator.wikimedia.org/T338918) (owner: 10Marostegui) [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230616T0600) [06:02:19] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:03:53] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:08:59] (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [06:17:55] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:19:25] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:32:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:34:02] (03CR) 10Dzahn: [C: 03+1] registry: Add nginx logs to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/930719 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney) [06:42:35] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:44:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:55:37] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:55:37] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:56:59] (PuppetDisabled) firing: Puppet disabled on puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [06:59:41] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230616T0700) [07:00:13] RECOVERY - Check systemd state on idm1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:01:15] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:01:55] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:01:55] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:49] hashar: I would like to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/930689 to fix a regression in wmf.13. Is that OK with you? [07:11:15] (cc jnuche as this week's train operator) [07:11:59] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:12:40] (03CR) 10Elukey: changeprop: remove match on specific wiki_id for outlink stream (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [07:13:29] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:21:26] kostajh: sure :) [07:22:02] hashar: thanks! I've asked for SRE approval in #wikimedia-sre per https://wikitech.wikimedia.org/wiki/Deployments/Emergencies [07:22:43] (03PS1) 10Hashar: Revert "Structured tasks: Fix toolbar rewriting" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930692 (https://phabricator.wikimedia.org/T339292) [07:23:01] I guess the revert in VisualEditor got through cause there is no integration tests with Kartographer [07:25:44] With GrowthExperiments, rather [07:26:01] we do have some integration tests, but nothing that is going to check if there are two publish buttons on the page :\ [07:26:24] as I usually say: shit happens ™ [07:26:41] (03CR) 10Elukey: "Should we also remove the `profile::base::remove_python2_on_bullseye: false` settings as well in hiera?" [puppet] - 10https://gerrit.wikimedia.org/r/930738 (https://phabricator.wikimedia.org/T310980) (owner: 10Eevans) [07:26:46] kostajh: no objections here either [07:27:00] what problems are we seeing in prod? [07:27:20] https://phabricator.wikimedia.org/F37102410 [07:27:21] jnuche: T338934 [07:27:22] T338934: [betalabs] Duplicate Publish button for Structured tasks - https://phabricator.wikimedia.org/T338934 [07:27:31] top right has two blue "Publish changes..." buttons :] [07:27:58] (03CR) 10Hashar: [C: 03+2] Revert "Structured tasks: Fix toolbar rewriting" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930692 (https://phabricator.wikimedia.org/T339292) (owner: 10Hashar) [07:28:13] gotcha, thx [07:30:24] kostajh: patch is in the pipes, so theorically it can be tested on beta [07:30:38] and when the wmf one is merged in I will pull it on mwdebug [07:31:18] hashar: it's not merged yet AFAICT, so it won't be on beta for a while. It will be faster to verify in production with mwdebug, I think [07:32:25] +1 [07:38:03] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:38:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-b1-codfw.mgmt.codfw.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [07:39:33] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:48:44] (03CR) 10Klausman: changeprop: remove match on specific wiki_id for outlink stream (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [07:50:36] (03PS1) 10Klausman: ml-services: update outlink replica counts to 3/5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/930748 (https://phabricator.wikimedia.org/T328899) [07:50:37] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.remove-downtime for acmechief2001.codfw.wmnet [07:50:38] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2001.codfw.wmnet [07:54:28] (03Merged) 10jenkins-bot: Revert "Structured tasks: Fix toolbar rewriting" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930692 (https://phabricator.wikimedia.org/T339292) (owner: 10Hashar) [07:58:15] kostajh: I am running the backport [07:58:39] hashar: is it on mwdebug? [07:58:41] !log hashar@deploy1002 Started scap: Backport for [[gerrit:930692|Revert "Structured tasks: Fix toolbar rewriting" (T339292 T338934)]] [07:58:46] T339292: Issues with gadgets adding tools to VisualEditor "Page options" dropdown (ve.init.Target.actionGroups[1] is undefined) - https://phabricator.wikimedia.org/T339292 [07:58:46] T338934: [betalabs] Duplicate Publish button for Structured tasks - https://phabricator.wikimedia.org/T338934 [07:58:51] I have just started [07:58:57] (03CR) 10Elukey: ml-services: update outlink replica counts to 3/5 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930748 (https://phabricator.wikimedia.org/T328899) (owner: 10Klausman) [08:00:05] !log hashar@deploy1002 hashar: Backport for [[gerrit:930692|Revert "Structured tasks: Fix toolbar rewriting" (T339292 T338934)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [08:00:28] (03PS2) 10Klausman: ml-services: update outlink replica counts [deployment-charts] - 10https://gerrit.wikimedia.org/r/930748 (https://phabricator.wikimedia.org/T328899) [08:00:34] kostajh: it is on mwdebug hosts [08:00:43] ok, looking [08:00:44] (03CR) 10Klausman: ml-services: update outlink replica counts (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930748 (https://phabricator.wikimedia.org/T328899) (owner: 10Klausman) [08:04:25] (03CR) 10DCausse: [C: 03+1] "lgtm," [deployment-charts] - 10https://gerrit.wikimedia.org/r/930675 (https://phabricator.wikimedia.org/T338233) (owner: 10Gmodena) [08:05:27] hashar: functionally, lgtm. No errors showing up in any dashboards, right? [08:06:09] hashar: mwdebug on logstash seems OK [08:06:32] we collect javascript client side errors, then if there are any you would have seen them locally [08:07:46] deploying [08:08:09] thx [08:09:47] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:09:47] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:10:19] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:11:51] * hashar whistles [08:12:44] parse1002.eqiad.wmnet had a connection timeout [08:12:50] then that change is javascript only so.. [08:13:35] (03PS2) 10Ayounsi: Allow MGMT ranges to make TFTP requests to install server [puppet] - 10https://gerrit.wikimedia.org/r/930727 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [08:13:48] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930727 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [08:14:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:19:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:19:43] (03CR) 10Ayounsi: [C: 03+1] Allow MGMT ranges to make TFTP requests to install server [puppet] - 10https://gerrit.wikimedia.org/r/930727 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [08:19:49] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:930692|Revert "Structured tasks: Fix toolbar rewriting" (T339292 T338934)]] (duration: 21m 08s) [08:19:55] T339292: Issues with gadgets adding tools to VisualEditor "Page options" dropdown (ve.init.Target.actionGroups[1] is undefined) - https://phabricator.wikimedia.org/T339292 [08:19:55] T338934: [betalabs] Duplicate Publish button for Structured tasks - https://phabricator.wikimedia.org/T338934 [08:25:58] !log akosiaris@cumin1001 conftool action : set/pooled=inactive; selector: name=parse1002.eqiad.wmnet [08:31:19] (03CR) 10Elukey: [C: 03+1] ml-services: update outlink replica counts [deployment-charts] - 10https://gerrit.wikimedia.org/r/930748 (https://phabricator.wikimedia.org/T328899) (owner: 10Klausman) [08:31:39] (03CR) 10Klausman: [C: 03+2] ml-services: update outlink replica counts [deployment-charts] - 10https://gerrit.wikimedia.org/r/930748 (https://phabricator.wikimedia.org/T328899) (owner: 10Klausman) [08:32:34] (03Merged) 10jenkins-bot: ml-services: update outlink replica counts [deployment-charts] - 10https://gerrit.wikimedia.org/r/930748 (https://phabricator.wikimedia.org/T328899) (owner: 10Klausman) [08:32:49] parse1002.eqiad.wmnet has cpu issue and has been removed from the pool / conftool [08:32:52] so should no more be an issue [08:33:08] kostajh: it is fully in production so the duplicate button should no more occur [08:33:14] nothing on mw log errors as expected [08:35:05] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [08:40:28] (03PS3) 10Slyngshede: C:idm::deployment switch MariaDB driver. [puppet] - 10https://gerrit.wikimedia.org/r/929951 [08:41:21] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [08:41:22] (03PS4) 10Slyngshede: C:idm::deployment switch MariaDB driver. [puppet] - 10https://gerrit.wikimedia.org/r/929951 [08:43:16] (03CR) 10Slyngshede: C:idm::deployment switch MariaDB driver. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929951 (owner: 10Slyngshede) [08:45:57] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [08:45:57] PROBLEM - Check systemd state on parse1002 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:29] PROBLEM - puppet last run on parse1002 is CRITICAL: CRITICAL: Puppet last ran 17 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:27] RECOVERY - Check systemd state on parse1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:30] (03PS1) 10Slyngshede: Requirements: Use pure Python database driver. [software/bitu] - 10https://gerrit.wikimedia.org/r/930751 [08:47:49] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [08:49:04] (03CR) 10Ayounsi: [C: 03+1] "cool, thanks for the details! Stats are nice but not required." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930222 (https://phabricator.wikimedia.org/T321704) (owner: 10Cathal Mooney) [08:52:01] RECOVERY - puppet last run on parse1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:55:23] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10cmooney) Thanks @papaul great work! So it seems I was wrong we can run homer on these without having to add anything to devices.yaml for now. Just need to set... [08:55:53] 10SRE-tools, 10Infrastructure-Foundations, 10homer: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415 (10ayounsi) I'm wondering if we could look at prioritizing this work. With new network devices arriving in codfw, we're reaching the limit of configuring network devices one afte... [09:00:08] 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002 - https://phabricator.wikimedia.org/T339340 (10akosiaris) [09:00:30] 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10akosiaris) [09:01:27] (03CR) 10Cathal Mooney: [C: 03+2] Allow MGMT ranges to make TFTP requests to install server [puppet] - 10https://gerrit.wikimedia.org/r/930727 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [09:01:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:02:00] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [09:03:10] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:05:44] 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10akosiaris) [09:06:02] 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10akosiaris) Host appears functional right now, but I don't trust it to put it back into rotation. [09:11:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:12:46] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) ZTP via TFTP now tested and working fully in codfw :) [09:12:49] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:23] PROBLEM - mediawiki-installation DSH group on parse1002 is CRITICAL: Host parse1002 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:15:02] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw:basic spines/leaves configuration using ZTP - https://phabricator.wikimedia.org/T339315 (10cmooney) Thanks @papaul! I commented back on the other task, seems I was wrong about the requirement for devices.yaml: T332180#8937991 [09:22:58] (03PS1) 10Cathal Mooney: Do not push class-of-service buffer partition to ex4300 [homer/public] - 10https://gerrit.wikimedia.org/r/930754 (https://phabricator.wikimedia.org/T284592) [09:31:35] RECOVERY - Check systemd state on thumbor2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:02] 10SRE, 10Wikimedia-Mailing-lists: Removal of email address from mailman removed all subscriptions - https://phabricator.wikimedia.org/T339341 (10Ladsgroup) a:03Ladsgroup [09:46:50] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Requirements: Use pure Python database driver. [software/bitu] - 10https://gerrit.wikimedia.org/r/930751 (owner: 10Slyngshede) [09:48:51] (03PS2) 10Slyngshede: Requirements: Cleanup requirements files. [software/bitu] - 10https://gerrit.wikimedia.org/r/929960 [09:51:03] 10SRE, 10Wikimedia-Mailing-lists: Removal of email address from mailman removed all subscriptions - https://phabricator.wikimedia.org/T339341 (10Ladsgroup) I accepted your req in global sysops and CN admins, added you to ops and stewards and global renamers. You have been already a member of CU. Is anything left? [09:52:15] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Requirements: Cleanup requirements files. [software/bitu] - 10https://gerrit.wikimedia.org/r/929960 (owner: 10Slyngshede) [10:08:59] (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [10:18:25] (03CR) 10Hnowlan: [C: 03+2] images: log key limited by poolcounter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930664 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [10:22:59] (03CR) 10Ayounsi: [C: 03+1] Do not push class-of-service buffer partition to ex4300 [homer/public] - 10https://gerrit.wikimedia.org/r/930754 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [10:24:31] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930601 [10:24:33] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930602 [10:27:25] (03Merged) 10jenkins-bot: images: log key limited by poolcounter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930664 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [10:29:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:29:18] (03PS1) 10Vgutierrez: acme-chief: Fix PASSIVE_FQDN syntax [puppet] - 10https://gerrit.wikimedia.org/r/930761 (https://phabricator.wikimedia.org/T321309) [10:29:44] (03CR) 10CI reject: [V: 04-1] acme-chief: Fix PASSIVE_FQDN syntax [puppet] - 10https://gerrit.wikimedia.org/r/930761 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [10:32:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:32:51] (03PS2) 10Vgutierrez: acme-chief: Fix PASSIVE_FQDN syntax [puppet] - 10https://gerrit.wikimedia.org/r/930761 (https://phabricator.wikimedia.org/T321309) [10:33:58] (03CR) 10Cathal Mooney: [C: 03+2] Do not push class-of-service buffer partition to ex4300 [homer/public] - 10https://gerrit.wikimedia.org/r/930754 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [10:34:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:34:34] (03Merged) 10jenkins-bot: Do not push class-of-service buffer partition to ex4300 [homer/public] - 10https://gerrit.wikimedia.org/r/930754 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [10:35:20] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41760/console" [puppet] - 10https://gerrit.wikimedia.org/r/930761 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [10:35:50] (03CR) 10Vgutierrez: acme-chief: Fix PASSIVE_FQDN syntax [puppet] - 10https://gerrit.wikimedia.org/r/930761 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [10:37:19] (03CR) 10Vgutierrez: "just for reviewing context, this config file is used by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/prod" [puppet] - 10https://gerrit.wikimedia.org/r/930761 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [10:37:42] (03CR) 10Clément Goubert: [C: 03+1] deployment_server: Add opentelemetry-collector kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/930739 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [10:38:19] (03CR) 10Clément Goubert: [C: 03+1] admin_ng: Add namespace for opentelemetry-collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/930740 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [10:38:34] !log root@cumin1001:/home/ladsgroup/software2/dbtools# cat s1.dblist | grep -v "#" | while read db; do cat tables_to_check.txt | while read table index; do echo "$db.$table"; db-compare $db $table $index db1135.eqiad.wmnet:3306 db1118 db1139:3311 || break 2; done ; done (T338354) [10:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:38] T338354: db1135 has crashed - https://phabricator.wikimedia.org/T338354 [10:40:52] (03CR) 10Fabfur: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/930761 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [10:41:13] (03CR) 10Vgutierrez: [C: 03+2] acme-chief: Fix PASSIVE_FQDN syntax [puppet] - 10https://gerrit.wikimedia.org/r/930761 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [10:42:34] 10SRE, 10Wikimedia-Mailing-lists: Removal of email address from mailman removed all subscriptions - https://phabricator.wikimedia.org/T339341 (10hoo) >>! In T339341#8938339, @Ladsgroup wrote: > I accepted your req in global sysops and CN admins, added you to ops and stewards and global renamers. You have been... [10:46:04] (03CR) 10Clément Goubert: [C: 03+2] deployment_server: set user.email and user.name in git config [puppet] - 10https://gerrit.wikimedia.org/r/929400 (https://phabricator.wikimedia.org/T307775) (owner: 10Chad) [10:56:59] (PuppetDisabled) firing: Puppet disabled on puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [11:00:03] (03PS1) 10Slyngshede: P:IDM Failover Redis to CODFW. [puppet] - 10https://gerrit.wikimedia.org/r/930763 [11:00:26] (03PS1) 10Hnowlan: thumbor: double CPU usage, quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/930764 [11:00:41] (03PS1) 10Btullis: Revert the change in jar version for refine_sanitize [puppet] - 10https://gerrit.wikimedia.org/r/930765 (https://phabricator.wikimedia.org/T335308) [11:02:34] (03CR) 10Joal: [C: 03+1] "Thanks a lot for this Ben" [puppet] - 10https://gerrit.wikimedia.org/r/930765 (https://phabricator.wikimedia.org/T335308) (owner: 10Btullis) [11:03:31] (03CR) 10Btullis: [C: 03+2] Revert the change in jar version for refine_sanitize [puppet] - 10https://gerrit.wikimedia.org/r/930765 (https://phabricator.wikimedia.org/T335308) (owner: 10Btullis) [11:06:26] 10SRE, 10Wikimedia-Mailing-lists: Removal of email address from mailman removed all subscriptions - https://phabricator.wikimedia.org/T339341 (10Ladsgroup) Fixed the functionaries announce and global renamers. Mailman doesn't let me remove your account :/ let me remove both of your subscriptions and then put o... [11:06:59] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add device-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/930214 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [11:07:57] (03Merged) 10jenkins-bot: api-gateway: add device-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/930214 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [11:09:15] 10SRE, 10Wikimedia-Mailing-lists: Removal of email address from mailman removed all subscriptions - https://phabricator.wikimedia.org/T339341 (10Ladsgroup) nope that doesn't work either. Even mass unsub doesn't work. Sigh. I could probably unsub you by directly doing DELETE in the database. [11:09:38] (03PS2) 10Hnowlan: thumbor: double CPU usage, quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/930764 [11:14:19] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-worker1002.eqiad.wmnet [11:14:43] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:15:02] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [11:21:05] (03CR) 10Clément Goubert: [C: 03+1] thumbor: double CPU usage, quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/930764 (owner: 10Hnowlan) [11:21:54] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-worker1002.eqiad.wmnet [11:23:28] (03CR) 10Clément Goubert: [C: 03+2] service: add comment for spicerack field addition [puppet] - 10https://gerrit.wikimedia.org/r/909605 (owner: 10Clément Goubert) [11:32:14] (03CR) 10Hnowlan: [C: 03+2] thumbor: double CPU usage, quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/930764 (owner: 10Hnowlan) [11:34:46] (03Merged) 10jenkins-bot: thumbor: double CPU usage, quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/930764 (owner: 10Hnowlan) [11:38:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-b1-codfw.mgmt.codfw.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [11:41:29] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [11:46:31] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) new `pools.yaml` file: ` - also_notifies: [] attributes: {} description: Pool for pdns backing designate id: 794... [11:46:36] !log hnowlan@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:47:47] !log hnowlan@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:50:10] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) I just deleted netbox objects for IPv6: * 2620:0:860:2:208:80:153:47 * 2620:0:860:2:208:80:153:50 [11:50:33] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:53:03] !log hnowlan@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:53:32] !log hnowlan@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:53:50] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:55:40] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:56:07] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:56:12] (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: codfw1dev: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930647 (https://phabricator.wikimedia.org/T307357) [11:59:02] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:00:12] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:00:38] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [12:01:00] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/930647 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [12:01:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: codfw1dev: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930647 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [12:02:39] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [12:04:50] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [12:06:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:07:12] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [12:08:10] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts parse1002.eqiad.wmnet [12:10:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:10:38] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh ns0.openstack.codfw1dev address [dns] - 10https://gerrit.wikimedia.org/r/930791 (https://phabricator.wikimedia.org/T307357) [12:10:55] (03PS1) 10Arturo Borrero Gonzalez: wikimedia.cloud: fix delegation for codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/930792 (https://phabricator.wikimedia.org/T307357) [12:11:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:15:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:16:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts parse1002.eqiad.wmnet [12:18:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Jclark-ctr) Dell Service Request 170238017 [12:38:10] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney) [12:45:41] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:45:45] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:51:55] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:53:25] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:56:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50134 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:56:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:58:30] (Not accepting/receiving prefixes from anycast BGP peer) resolved: Alert for device cloudsw1-b1-codfw.mgmt.codfw.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [13:10:58] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Migrate remaining services using Java to profile::java - https://phabricator.wikimedia.org/T264174 (10Gehel) [13:11:19] (03PS2) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [13:11:51] 10Puppet, 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10Gehel) 05Open→03Resolved [13:12:58] (03PS3) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [13:15:13] (03CR) 10CI reject: [V: 04-1] C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [13:15:56] (03CR) 10D3r1ck01: Use Parsoid for all Wikis for Content Translation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930744 (https://phabricator.wikimedia.org/T339322) (owner: 10KartikMistry) [13:16:08] (03CR) 10D3r1ck01: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930744 (https://phabricator.wikimedia.org/T339322) (owner: 10KartikMistry) [13:16:57] (03PS4) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [13:19:38] (03CR) 10CI reject: [V: 04-1] C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [13:20:51] (03CR) 10KartikMistry: Use Parsoid for all Wikis for Content Translation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930744 (https://phabricator.wikimedia.org/T339322) (owner: 10KartikMistry) [13:21:14] (03CR) 10KartikMistry: Use Parsoid for all Wikis for Content Translation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930744 (https://phabricator.wikimedia.org/T339322) (owner: 10KartikMistry) [13:30:31] RECOVERY - WDQS SPARQL on wdqs2022 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.269 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:30:35] RECOVERY - Query Service HTTP Port on wdqs2022 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:51:10] (03CR) 10Andrew Bogott: [C: 03+2] Remove more cloud stretch support [puppet] - 10https://gerrit.wikimedia.org/r/927215 (owner: 10Muehlenhoff) [13:51:16] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-a1-codfw.mgmt.codfw.wmnet [13:51:18] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [13:52:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:52:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a1-codfw.mgmt.codfw.wmnet [13:54:35] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [13:55:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:23] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on parse1002.eqiad.wmnet with reason: hw troubleshooting [13:55:37] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on parse1002.eqiad.wmnet with reason: hw troubleshooting [13:55:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=449aef54-88d6-4ae4-81c0-075c731ff7c3) set by cgoubert@cumin1001 for 7 day... [14:02:34] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, makes sense to do this one first I think, this IP is currently being announced / reachable via cloudservices2005-dev" [dns] - 10https://gerrit.wikimedia.org/r/930791 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:02:58] (03PS2) 10Eevans: cassandra: use python3 as python [puppet] - 10https://gerrit.wikimedia.org/r/930738 (https://phabricator.wikimedia.org/T310980) [14:03:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: refresh ns0.openstack.codfw1dev address [dns] - 10https://gerrit.wikimedia.org/r/930791 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:05:22] (03CR) 10Eevans: cassandra: use python3 as python (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/930738 (https://phabricator.wikimedia.org/T310980) (owner: 10Eevans) [14:05:24] (03CR) 10Cathal Mooney: [C: 03+1] wikimediacloud.org: refresh ns0.openstack.codfw1dev address (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/930791 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:06:02] (03PS3) 10Eevans: cassandra: use python3 as python [puppet] - 10https://gerrit.wikimedia.org/r/930738 (https://phabricator.wikimedia.org/T310980) [14:07:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:59] (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [14:11:00] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: add ns.openstack.codfw1dev.wikimediacloud.org back [dns] - 10https://gerrit.wikimedia.org/r/930799 [14:11:12] (03PS2) 10Arturo Borrero Gonzalez: wikimedia.cloud: fix delegation for codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/930792 (https://phabricator.wikimedia.org/T307357) [14:11:33] (03CR) 10Cathal Mooney: [C: 03+1] wikimediacloud.org: add ns.openstack.codfw1dev.wikimediacloud.org back [dns] - 10https://gerrit.wikimedia.org/r/930799 (owner: 10Arturo Borrero Gonzalez) [14:11:53] (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: add ns.openstack.codfw1dev.wikimediacloud.org back [dns] - 10https://gerrit.wikimedia.org/r/930799 (owner: 10Arturo Borrero Gonzalez) [14:12:10] (03CR) 10CI reject: [V: 04-1] wikimedia.cloud: fix delegation for codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/930792 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:14:39] (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: add ns.openstack.codfw1dev.wikimediacloud.org back [dns] - 10https://gerrit.wikimedia.org/r/930799 [14:14:41] (03PS3) 10Arturo Borrero Gonzalez: wikimedia.cloud: fix delegation for codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/930792 (https://phabricator.wikimedia.org/T307357) [14:15:34] (03CR) 10CI reject: [V: 04-1] wikimedia.cloud: fix delegation for codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/930792 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:15:36] (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: add ns.openstack.codfw1dev.wikimediacloud.org back [dns] - 10https://gerrit.wikimedia.org/r/930799 (owner: 10Arturo Borrero Gonzalez) [14:17:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:32] (03PS2) 10Albertoleoncio: Enable Extension:Translate on pt.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930189 (https://phabricator.wikimedia.org/T339139) [14:21:49] (03CR) 10Urbanecm: [C: 04-1] Enable Extension:Translate on pt.wikisource.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930189 (https://phabricator.wikimedia.org/T339139) (owner: 10Albertoleoncio) [14:21:54] (03PS3) 10Albertoleoncio: Enable Extension:Translate on pt.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930189 (https://phabricator.wikimedia.org/T339139) [14:22:28] (03PS3) 10Arturo Borrero Gonzalez: wikimediacloud.org: add ns.openstack.codfw1dev.wikimediacloud.org back [dns] - 10https://gerrit.wikimedia.org/r/930799 [14:22:30] (03PS4) 10Arturo Borrero Gonzalez: wikimedia.cloud: fix delegation for codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/930792 (https://phabricator.wikimedia.org/T307357) [14:24:04] (03CR) 10Albertoleoncio: Enable Extension:Translate on pt.wikisource.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930189 (https://phabricator.wikimedia.org/T339139) (owner: 10Albertoleoncio) [14:24:36] (03CR) 10Cathal Mooney: [C: 03+1] wikimediacloud.org: add ns.openstack.codfw1dev.wikimediacloud.org back [dns] - 10https://gerrit.wikimedia.org/r/930799 (owner: 10Arturo Borrero Gonzalez) [14:26:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: add ns.openstack.codfw1dev.wikimediacloud.org back [dns] - 10https://gerrit.wikimedia.org/r/930799 (owner: 10Arturo Borrero Gonzalez) [14:28:39] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930189 (https://phabricator.wikimedia.org/T339139) (owner: 10Albertoleoncio) [14:31:44] (03PS1) 10Andrew Bogott: Openstack envscript.yaml.erb: set OS_VOLUME_API_VERSION [puppet] - 10https://gerrit.wikimedia.org/r/930804 [14:32:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimedia.cloud: fix delegation for codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/930792 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:34:36] (03PS1) 10Arturo Borrero Gonzalez: 57.15.185.in-addr.arpa: fix delegation [dns] - 10https://gerrit.wikimedia.org/r/930805 (https://phabricator.wikimedia.org/T307357) [14:38:40] (03CR) 10Cathal Mooney: [C: 03+1] 57.15.185.in-addr.arpa: fix delegation [dns] - 10https://gerrit.wikimedia.org/r/930805 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:39:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] 57.15.185.in-addr.arpa: fix delegation [dns] - 10https://gerrit.wikimedia.org/r/930805 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:40:05] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [14:40:05] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [14:40:14] (03CR) 10Elukey: [C: 03+1] cassandra: use python3 as python [puppet] - 10https://gerrit.wikimedia.org/r/930738 (https://phabricator.wikimedia.org/T310980) (owner: 10Eevans) [14:45:24] (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns: auth: service: drop IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/930807 (https://phabricator.wikimedia.org/T307357) [14:45:53] (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: eqiad1: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930648 (https://phabricator.wikimedia.org/T307357) [14:46:49] (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: eqiad1: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930648 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:50:36] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/output/930807/41761/" [puppet] - 10https://gerrit.wikimedia.org/r/930807 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:54:32] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [14:54:38] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) Deleting DNS name from the IPv6: * 2620:0:861:2:208:80:154:148/64 https://netbox.wikimedia.org/ip... [14:56:02] (03CR) 10Eevans: cassandra: use python3 as python (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/930738 (https://phabricator.wikimedia.org/T310980) (owner: 10Eevans) [14:56:59] (PuppetDisabled) firing: Puppet disabled on puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [14:57:14] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack IPv6 - aborrero@cumin1001" [14:59:10] (03PS2) 10AikoChou: changeprop: set wiki_id match config for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) [14:59:13] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack IPv6 - aborrero@cumin1001" [14:59:13] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:00:51] (03PS3) 10AikoChou: changeprop: set wiki_id match config for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) [15:03:53] (03PS3) 10Arturo Borrero Gonzalez: wikimediacloud.org: eqiad1: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930648 (https://phabricator.wikimedia.org/T307357) [15:03:55] (03CR) 10AikoChou: changeprop: set wiki_id match config for outlink stream (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [15:06:44] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: drop unused ns-recursor[0,1] FQDNs [dns] - 10https://gerrit.wikimedia.org/r/930808 (https://phabricator.wikimedia.org/T307357) [15:09:46] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [15:09:53] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) Deleting DNS entries for: * 208.80.153.47/32 https://netbox.wikimedia.org/ipam/ip-addresses/10716... [15:12:05] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack IPv6 - aborrero@cumin1001" [15:13:06] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack IPv6 - aborrero@cumin1001" [15:13:07] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:13:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: drop unused ns-recursor[0,1] FQDNs [dns] - 10https://gerrit.wikimedia.org/r/930808 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [15:15:32] (03PS4) 10Arturo Borrero Gonzalez: wikimediacloud.org: eqiad1: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930648 (https://phabricator.wikimedia.org/T307357) [15:16:33] (03CR) 10Cathal Mooney: [C: 03+1] wikimediacloud.org: eqiad1: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930648 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [15:16:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: eqiad1: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930648 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [15:18:30] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/output/930807/41761/" [puppet] - 10https://gerrit.wikimedia.org/r/930807 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [15:21:27] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: auth: also drop IPv6 references [puppet] - 10https://gerrit.wikimedia.org/r/930813 [15:27:22] (03PS1) 10Arturo Borrero Gonzalez: openstack: base: pdns: auth: service: fix ferm typo [puppet] - 10https://gerrit.wikimedia.org/r/930814 [15:28:49] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [15:31:29] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: auth: also drop IPv6 references [puppet] - 10https://gerrit.wikimedia.org/r/930813 [15:31:31] (03PS2) 10Arturo Borrero Gonzalez: openstack: base: pdns: auth: service: fix ferm typo [puppet] - 10https://gerrit.wikimedia.org/r/930814 [15:34:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: pdns: auth: also drop IPv6 references [puppet] - 10https://gerrit.wikimedia.org/r/930813 (owner: 10Arturo Borrero Gonzalez) [15:34:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: base: pdns: auth: service: fix ferm typo [puppet] - 10https://gerrit.wikimedia.org/r/930814 (owner: 10Arturo Borrero Gonzalez) [15:49:31] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) Updating the poo yields: ` Updating Pools Configuration **************************** An error has occurred: Traceback... [15:49:57] (03CR) 10Andrew Bogott: [C: 03+1] puppet-enc: add tests for add_git_commit [puppet] - 10https://gerrit.wikimedia.org/r/875866 (owner: 10David Caro) [15:51:46] (03CR) 10Andrew Bogott: [C: 03+1] "This lgtm but I've caused enough lvs alerts that I'd like someone more familiar with lvs to sign-off/merge." [puppet] - 10https://gerrit.wikimedia.org/r/831176 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [15:52:31] (03CR) 10Andrew Bogott: [C: 03+2] openstack: nova: restrict rebuilds to admins [puppet] - 10https://gerrit.wikimedia.org/r/887299 (https://phabricator.wikimedia.org/T302404) (owner: 10Majavah) [15:54:33] (03PS1) 10Elukey: ml-services: update the falcon-7b's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/930816 [15:54:43] (03PS2) 10Andrew Bogott: openstack: nova: restrict rebuilds to admins [puppet] - 10https://gerrit.wikimedia.org/r/887299 (https://phabricator.wikimedia.org/T302404) (owner: 10Majavah) [15:55:28] (03CR) 10Elukey: [C: 03+2] ml-services: update the falcon-7b's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/930816 (owner: 10Elukey) [15:58:06] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:58:17] (03CR) 10Andrew Bogott: [C: 03+2] openstack: nova: restrict rebuilds to admins [puppet] - 10https://gerrit.wikimedia.org/r/887299 (https://phabricator.wikimedia.org/T302404) (owner: 10Majavah) [15:59:16] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [16:05:40] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: make designate use cloud-private address by default [puppet] - 10https://gerrit.wikimedia.org/r/930817 (https://phabricator.wikimedia.org/T338778) [16:09:41] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [16:12:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: make designate use cloud-private address by default [puppet] - 10https://gerrit.wikimedia.org/r/930817 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez) [16:14:08] (03CR) 10Andrew Bogott: [C: 03+2] wikitech_private: convert to new array syntax [puppet] - 10https://gerrit.wikimedia.org/r/779860 (owner: 10Zabe) [16:14:46] !log Rolling reboot of codfw cache_upload nodes to apply Linux update for CVE-2023-1872 - T335835 [16:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:08] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: designate: fix recursor_service_name [puppet] - 10https://gerrit.wikimedia.org/r/930818 (https://phabricator.wikimedia.org/T307357) [16:27:41] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: more hiera adjustements [puppet] - 10https://gerrit.wikimedia.org/r/930820 (https://phabricator.wikimedia.org/T307357) [16:28:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: designate: fix recursor_service_name [puppet] - 10https://gerrit.wikimedia.org/r/930818 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [16:28:45] (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: more hiera adjustements [puppet] - 10https://gerrit.wikimedia.org/r/930820 (https://phabricator.wikimedia.org/T307357) [16:29:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: more hiera adjustements [puppet] - 10https://gerrit.wikimedia.org/r/930820 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [16:34:51] (03CR) 10Herron: [C: 03+1] hiera: actually delete chunks from loki [puppet] - 10https://gerrit.wikimedia.org/r/929749 (https://phabricator.wikimedia.org/T335610) (owner: 10Cwhite) [16:39:26] (03PS1) 10Arturo Borrero Gonzalez: nova-fullstack: use moder resolver hiera [puppet] - 10https://gerrit.wikimedia.org/r/930821 [16:39:49] (03PS2) 10Arturo Borrero Gonzalez: nova-fullstack: use modern resolver hiera [puppet] - 10https://gerrit.wikimedia.org/r/930821 [16:43:39] (03CR) 10Andrew Bogott: [C: 03+1] nova-fullstack: use modern resolver hiera [puppet] - 10https://gerrit.wikimedia.org/r/930821 (owner: 10Arturo Borrero Gonzalez) [16:47:56] (03PS3) 10Arturo Borrero Gonzalez: nova-fullstack: use modern resolver hiera [puppet] - 10https://gerrit.wikimedia.org/r/930821 [16:50:54] (03CR) 10Arturo Borrero Gonzalez: [V: 04-1] "PCC: https://puppet-compiler.wmflabs.org/output/930821/41764/cloudcontrol1005.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/930821 (owner: 10Arturo Borrero Gonzalez) [16:53:21] jouncebot: nowandnext [16:53:21] For the next 14 hour(s) and 6 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230616T0700) [16:53:21] In 14 hour(s) and 6 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230617T0700) [16:56:29] (03PS1) 10Btullis: Add support for upgrading datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/930825 (https://phabricator.wikimedia.org/T329514) [16:57:17] (03PS4) 10Arturo Borrero Gonzalez: nova-fullstack: stop using labsdnsconfig [puppet] - 10https://gerrit.wikimedia.org/r/930821 [17:01:24] (03PS5) 10Arturo Borrero Gonzalez: nova-fullstack: stop using labsdnsconfig [puppet] - 10https://gerrit.wikimedia.org/r/930821 [17:03:16] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "This PCC looks right: https://puppet-compiler.wmflabs.org/output/930821/41766/" [puppet] - 10https://gerrit.wikimedia.org/r/930821 (owner: 10Arturo Borrero Gonzalez) [17:03:24] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] nova-fullstack: stop using labsdnsconfig [puppet] - 10https://gerrit.wikimedia.org/r/930821 (owner: 10Arturo Borrero Gonzalez) [17:06:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:07:29] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:08:24] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) After many changes, it still shows memcached problems. [17:08:53] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50134 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:09:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.248 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:05:24] (03CR) 10Andrew Bogott: [C: 03+2] cinderutils: stop provisioning old filename on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/906034 (owner: 10Majavah) [18:07:15] (03CR) 10Vivian Rook: [C: 03+1] magnum: Have podman limit size of logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586) (owner: 10Andrew Bogott) [18:08:59] (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [18:09:38] (03CR) 10Andrew Bogott: [C: 03+2] magnum: Have podman limit size of logs [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586) (owner: 10Andrew Bogott) [18:17:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:22:33] (03CR) 10CDanis: [C: 03+1] haproxy: Add support for filter bwlim-(in|out) [puppet] - 10https://gerrit.wikimedia.org/r/928541 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [18:26:25] (03CR) 10Andrew Bogott: [C: 03+2] eqiad1 cumin master: include observer project in config [puppet] - 10https://gerrit.wikimedia.org/r/869319 (owner: 10Andrew Bogott) [18:32:49] (03CR) 10CDanis: "Don't we still need to define a filter stanza ...?" [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [18:34:37] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10rook) [18:47:48] (03CR) 10D3r1ck01: [C: 03+1] Use Parsoid for all Wikis for Content Translation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930744 (https://phabricator.wikimedia.org/T339322) (owner: 10KartikMistry) [18:51:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1156.eqiad.wmnet with OS bullseye [18:51:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye [18:56:59] (PuppetDisabled) firing: Puppet disabled on puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [19:03:38] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [19:03:44] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 05s) [19:15:02] (03PS1) 10CDanis: new yubikey w/ ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/930859 (https://phabricator.wikimedia.org/T336769) [19:17:15] (03PS1) 10CDanis: cdanis: new ed25519 ssh key [homer/public] - 10https://gerrit.wikimedia.org/r/930860 (https://phabricator.wikimedia.org/T336769) [19:39:46] (03CR) 10Andrew Bogott: [C: 03+2] Openstack envscript.yaml.erb: set OS_VOLUME_API_VERSION [puppet] - 10https://gerrit.wikimedia.org/r/930804 (owner: 10Andrew Bogott) [19:47:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1156.eqiad.wmnet with OS bullseye [19:47:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye execute... [20:32:36] (03PS1) 10Bking: wdqs: Enable profile::java [puppet] - 10https://gerrit.wikimedia.org/r/930870 (https://phabricator.wikimedia.org/T264181) [20:35:24] (03PS2) 10Bking: wdqs: Enable profile::java [puppet] - 10https://gerrit.wikimedia.org/r/930870 (https://phabricator.wikimedia.org/T264181) [20:38:42] (03CR) 10Gehel: [C: 03+1] "LGTM. Thanks for the fix!" [puppet] - 10https://gerrit.wikimedia.org/r/930870 (https://phabricator.wikimedia.org/T264181) (owner: 10Bking) [20:38:59] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930870 (https://phabricator.wikimedia.org/T264181) (owner: 10Bking) [20:44:47] (03CR) 10Gehel: [C: 04-1] "This should already be taken care of by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata" [puppet] - 10https://gerrit.wikimedia.org/r/930870 (https://phabricator.wikimedia.org/T264181) (owner: 10Bking) [20:45:40] (03Abandoned) 10Bking: wdqs: Enable profile::java [puppet] - 10https://gerrit.wikimedia.org/r/930870 (https://phabricator.wikimedia.org/T264181) (owner: 10Bking) [20:52:28] Hey all - I know it’s later on a Friday, but I’d like to deploy a very slight update to a security mitigation in PS.php for T336027. It’s restricted to a handful of ja projects, and most of the logic is remaining the same. Let me know if there are any objections. [21:04:42] !log Finished rolling reboot of codfw cache_upload nodes to apply Linux update for CVE-2023-1872 - T335835 [21:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:38] > Yes Seconds Behind Master [21:08:40] :D [21:08:48] !log Deployed updated security mitigation for T336027 [21:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:22] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS buster [22:08:59] (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [22:17:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:25:47] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2021.codfw.wmnet with OS buster [22:29:00] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [22:35:30] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [22:56:59] (PuppetDisabled) firing: Puppet disabled on puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [22:58:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:07:32] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:07:34] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:08:02] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:13:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:41:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:46:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:48:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency