[00:20:45] RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [00:28:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [00:39:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/913943 [00:39:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/913943 (owner: 10TrainBranchBot) [00:47:08] !log restart haproxy on cp2031: T334448 [00:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:11] T334448: HAProxy 2.6.12 segfaults - https://phabricator.wikimedia.org/T334448 [00:56:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/913943 (owner: 10TrainBranchBot) [01:12:42] (03PS2) 10Andrew Bogott: wmcs-webproxy: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914416 (https://phabricator.wikimedia.org/T330759) [01:12:44] (03PS2) 10Andrew Bogott: wmcs-wikireplica-dns: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914417 (https://phabricator.wikimedia.org/T330759) [01:12:47] (03PS2) 10Andrew Bogott: wmcs-enc-cli: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914418 (https://phabricator.wikimedia.org/T330759) [01:12:48] (03PS3) 10Andrew Bogott: wmcs notify_maintainers: use mwopenstackclients for keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/914419 (https://phabricator.wikimedia.org/T330759) [01:55:17] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder) [02:07:54] (JobUnavailable) firing: (5) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:30] (JobUnavailable) firing: (5) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:27] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:42:09] (03PS10) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [02:45:02] (03PS1) 10Andrew Bogott: wmcs-spreadcheck: use clouds.yaml section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914463 (https://phabricator.wikimedia.org/T330759) [02:45:04] (03PS1) 10Andrew Bogott: nfs-exportd: convert to using mwopenstackclients and --os-cloud [puppet] - 10https://gerrit.wikimedia.org/r/914464 (https://phabricator.wikimedia.org/T330759) [03:38:30] (SystemdUnitFailed) firing: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:28:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [04:48:54] (03PS1) 10KartikMistry: Update cxserver to 2023-05-03-044244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914468 (https://phabricator.wikimedia.org/T333835) [05:32:37] (03CR) 10Elukey: [C: 03+2] fastapi-app: add networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914342 (owner: 10Giuseppe Lavagetto) [05:32:52] (03PS5) 10Elukey: ml-services: add network policies for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914319 (https://phabricator.wikimedia.org/T330414) [05:39:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: Disconnecting codfw > eqiad T335267 [05:39:21] T335267: Disconnect replication codfw -> eqiad - https://phabricator.wikimedia.org/T335267 [05:39:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: Disconnecting codfw > eqiad T335267 [05:40:33] !log Disconnect codfw -> eqiad replication on pc1 T335267 [05:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:36] !log Disconnect codfw -> eqiad replication on pc2 T335267 [05:40:38] !log Disconnect codfw -> eqiad replication on pc3 T335267 [05:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:44] (03CR) 10Elukey: [C: 03+2] ml-services: add network policies for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914319 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [05:40:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Disconnecting codfw > eqiad T335267 [05:41:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Disconnecting codfw > eqiad T335267 [05:41:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Disconnecting codfw > eqiad T335267 [05:41:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Disconnecting codfw > eqiad T335267 [05:44:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 10 hosts with reason: Disconnecting codfw > eqiad T335267 [05:44:10] !log Disconnect codfw -> eqiad replication on x1 T335267 [05:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 10 hosts with reason: Disconnecting codfw > eqiad T335267 [05:44:28] T335267: Disconnect replication codfw -> eqiad - https://phabricator.wikimedia.org/T335267 [05:47:08] (03PS1) 10Samwilson: Remove duplicated diff-mode selector in save dialog [extensions/VisualEditor] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914429 (https://phabricator.wikimedia.org/T324759) [05:48:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 6 hosts with reason: Disconnecting codfw > eqiad T335267 [05:48:14] !log Disconnect codfw -> eqiad replication on es4 T335267 [05:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 6 hosts with reason: Disconnecting codfw > eqiad T335267 [05:51:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 6 hosts with reason: Disconnecting codfw > eqiad T335267 [05:51:27] !log Disconnect codfw -> eqiad replication on es5 T335267 [05:51:28] T335267: Disconnect replication codfw -> eqiad - https://phabricator.wikimedia.org/T335267 [05:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 6 hosts with reason: Disconnecting codfw > eqiad T335267 [05:54:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 27 hosts with reason: Disconnecting codfw > eqiad T335267 [05:54:10] !log Disconnect codfw -> eqiad replication on s6 T335267 [05:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 27 hosts with reason: Disconnecting codfw > eqiad T335267 [05:57:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 27 hosts with reason: Disconnecting codfw > eqiad T335267 [05:57:46] !log Disconnect codfw -> eqiad replication on s2 T335267 [05:57:47] T335267: Disconnect replication codfw -> eqiad - https://phabricator.wikimedia.org/T335267 [05:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 27 hosts with reason: Disconnecting codfw > eqiad T335267 [05:59:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 26 hosts with reason: Disconnecting codfw > eqiad T335267 [05:59:22] !log Disconnect codfw -> eqiad replication on s5 T335267 [05:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 26 hosts with reason: Disconnecting codfw > eqiad T335267 [06:00:02] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder) [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T0600) [06:01:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 24 hosts with reason: Disconnecting codfw > eqiad T335267 [06:01:58] !log Disconnect codfw -> eqiad replication on s3 T335267 [06:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 24 hosts with reason: Disconnecting codfw > eqiad T335267 [06:06:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 28 hosts with reason: Disconnecting codfw > eqiad T335267 [06:06:29] T335267: Disconnect replication codfw -> eqiad - https://phabricator.wikimedia.org/T335267 [06:06:36] !log Disconnect codfw -> eqiad replication on s7 T335267 [06:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 28 hosts with reason: Disconnecting codfw > eqiad T335267 [06:09:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 35 hosts with reason: Disconnecting codfw > eqiad T335267 [06:09:37] !log Disconnect codfw -> eqiad replication on s4 T335267 [06:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 35 hosts with reason: Disconnecting codfw > eqiad T335267 [06:14:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 34 hosts with reason: Disconnecting codfw > eqiad T335267 [06:14:09] !log Disconnect codfw -> eqiad replication on s8 T335267 [06:14:09] T335267: Disconnect replication codfw -> eqiad - https://phabricator.wikimedia.org/T335267 [06:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 34 hosts with reason: Disconnecting codfw > eqiad T335267 [06:19:23] (03PS2) 10EoghanGaffney: [gitlab/runner] Add basic pool/depool commands [puppet] - 10https://gerrit.wikimedia.org/r/913199 [06:20:10] (03PS1) 10Jelto: aptrepo: update gitlab-ce and gitlab-runner to 15.9 [puppet] - 10https://gerrit.wikimedia.org/r/914594 (https://phabricator.wikimedia.org/T335784) [06:23:30] (JobUnavailable) firing: (4) Reduced availability for job thanos-sidecar in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:25:25] (03CR) 10Ayounsi: [C: 03+2] Apply black to all python files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/907880 (owner: 10Ayounsi) [06:26:23] (03Merged) 10jenkins-bot: Apply black to all python files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/907880 (owner: 10Ayounsi) [06:26:25] (03CR) 10Ayounsi: [C: 03+1] sites.yaml: remove decommissioned host lvs2007 from lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/914344 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [06:28:54] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [06:29:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [06:41:13] (03PS1) 10Ayounsi: Replace most .format() to f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597 [06:42:05] (03CR) 10CI reject: [V: 04-1] Replace most .format() to f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597 (owner: 10Ayounsi) [06:43:33] (03PS2) 10Ayounsi: Replace most .format() to f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597 [06:45:27] (03PS5) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [06:46:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 38 hosts with reason: Disconnecting codfw > eqiad T335267 [06:46:03] !log Disconnect codfw -> eqiad replication on s1 T335267 [06:46:03] T335267: Disconnect replication codfw -> eqiad - https://phabricator.wikimedia.org/T335267 [06:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 38 hosts with reason: Disconnecting codfw > eqiad T335267 [06:47:00] (03CR) 10CI reject: [V: 04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [06:48:23] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10Marostegui) @Papaul what else do they need? We have pasted their idrac's log [06:48:27] (03CR) 10Ayounsi: "Messages Found: 298 with Ib6aaba35a1aa34ac1680110a6fc265bf9b72bfb9" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [06:50:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1117.eqiad.wmnet [06:51:16] (03PS1) 10Marostegui: site.pp: Decommission db1117 [puppet] - 10https://gerrit.wikimedia.org/r/914696 (https://phabricator.wikimedia.org/T335017) [06:53:10] (03CR) 10Muehlenhoff: [C: 03+2] Add python-all to make pybal buildable on build2001 [puppet] - 10https://gerrit.wikimedia.org/r/914349 (owner: 10Muehlenhoff) [06:53:18] (03PS2) 10Muehlenhoff: Add python-all to make pybal buildable on build2001 [puppet] - 10https://gerrit.wikimedia.org/r/914349 [06:55:42] (03CR) 10Marostegui: [C: 03+2] site.pp: Decommission db1117 [puppet] - 10https://gerrit.wikimedia.org/r/914696 (https://phabricator.wikimedia.org/T335017) (owner: 10Marostegui) [06:55:53] (03CR) 10JMeybohm: [C: 04-1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 (owner: 10Giuseppe Lavagetto) [06:56:02] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [06:56:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/914594 (https://phabricator.wikimedia.org/T335784) (owner: 10Jelto) [06:57:01] (03CR) 10Jelto: [C: 03+2] aptrepo: update gitlab-ce and gitlab-runner to 15.9 [puppet] - 10https://gerrit.wikimedia.org/r/914594 (https://phabricator.wikimedia.org/T335784) (owner: 10Jelto) [06:58:01] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission db1117.eqiad.wmnet - https://phabricator.wikimedia.org/T335017 (10Marostegui) [06:58:51] (03PS1) 10Marostegui: wmnet: Replace db1117 with db1217 [dns] - 10https://gerrit.wikimedia.org/r/914697 (https://phabricator.wikimedia.org/T335017) [06:59:00] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Synchronize only the /srv/prometheus directory when migrating data [puppet] - 10https://gerrit.wikimedia.org/r/914400 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [06:59:49] (03CR) 10Marostegui: [C: 03+2] wmnet: Replace db1117 with db1217 [dns] - 10https://gerrit.wikimedia.org/r/914697 (https://phabricator.wikimedia.org/T335017) (owner: 10Marostegui) [07:00:01] (03CR) 10Filippo Giunchedi: "Need to remove ./hieradata/hosts/prometheus3001.yaml too" [puppet] - 10https://gerrit.wikimedia.org/r/913249 (https://phabricator.wikimedia.org/T33558) (owner: 10Andrea Denisse) [07:00:04] Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T0700) [07:00:05] samwilson: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:52] (03CR) 10Filippo Giunchedi: "Need to remove ./hieradata/hosts/prometheus4001.yaml too" [puppet] - 10https://gerrit.wikimedia.org/r/913250 (https://phabricator.wikimedia.org/T335585) (owner: 10Andrea Denisse) [07:01:03] (03PS1) 10Marostegui: install_server: Do not reimage db1214 [puppet] - 10https://gerrit.wikimedia.org/r/914698 [07:01:21] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1117.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:01:28] Amir1 urbanecm taavi hullo, I'm present; is one of you deploying today? [07:01:36] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1214 [puppet] - 10https://gerrit.wikimedia.org/r/914698 (owner: 10Marostegui) [07:01:40] yep, give me a second [07:02:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914429 (https://phabricator.wikimedia.org/T324759) (owner: 10Samwilson) [07:02:20] no hurry :) [07:02:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1117.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:02:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:02:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1117.eqiad.wmnet [07:02:39] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission db1117.eqiad.wmnet - https://phabricator.wikimedia.org/T335017 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1117.eqiad.wmnet` - db1117.eqiad.wmnet (**WARN**) - Downtimed host on... [07:05:10] 10ops-codfw, 10DBA: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Marostegui) @Jhancock.wm @Papaul let me know what do you need to make this happen? Do you need to turn the host off completely or just the idrac? [07:07:39] (03PS1) 10Marostegui: instances.yaml: Add db1213 (s5,s6) to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/914699 (https://phabricator.wikimedia.org/T326669) [07:08:52] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1213 (s5,s6) to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/914699 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [07:09:29] !log installing glibc bugfix updates from bullseye point release [07:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1213 (s5,s6) to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P47297 and previous config saved to /var/cache/conftool/dbconfig/20230503-071046-marostegui.json [07:10:49] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:11:44] (03PS1) 10Marostegui: db1213: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/914700 (https://phabricator.wikimedia.org/T326669) [07:12:12] (03CR) 10Marostegui: [C: 03+2] db1213: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/914700 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [07:13:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 1%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47298 and previous config saved to /var/cache/conftool/dbconfig/20230503-071303-root.json [07:13:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 1%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47299 and previous config saved to /var/cache/conftool/dbconfig/20230503-071313-root.json [07:15:03] (03PS1) 10Ayounsi: Add style checker and auto-formater to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 [07:17:48] (03Merged) 10jenkins-bot: Remove duplicated diff-mode selector in save dialog [extensions/VisualEditor] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914429 (https://phabricator.wikimedia.org/T324759) (owner: 10Samwilson) [07:17:59] finally [07:18:40] !log taavi@deploy1002 Started scap: Backport for [[gerrit:914429|Remove duplicated diff-mode selector in save dialog (T324759)]] [07:18:43] T324759: Inline Diff: Add legend and tooltips - https://phabricator.wikimedia.org/T324759 [07:19:24] PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: phabricator_clean_tmp_files.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:58] (03PS2) 10Ayounsi: Add style checker and auto-formater to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 [07:20:13] !log taavi@deploy1002 taavi and samwilson: Backport for [[gerrit:914429|Remove duplicated diff-mode selector in save dialog (T324759)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [07:20:19] samwilson: please test [07:20:41] thanks. testing now. [07:21:43] (03PS6) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [07:22:39] (03CR) 10CI reject: [V: 04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [07:22:47] taavi: all looks good [07:22:54] ok, syncing [07:22:56] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [07:23:30] 10SRE, 10DBA: db1132 index for table pagetriage_page is corrupt - https://phabricator.wikimedia.org/T335632 (10Marostegui) I have emailed Monty about this, as it affected 11.1 too - we'll see what he says. My guess is that this is not something specific for 10.6 or 11.0 as the 10.6 hosts in codfw didn't have t... [07:26:49] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [07:28:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 2%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47302 and previous config saved to /var/cache/conftool/dbconfig/20230503-072808-root.json [07:28:12] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:28:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 2%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47303 and previous config saved to /var/cache/conftool/dbconfig/20230503-072818-root.json [07:28:55] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:914429|Remove duplicated diff-mode selector in save dialog (T324759)]] (duration: 10m 14s) [07:28:58] T324759: Inline Diff: Add legend and tooltips - https://phabricator.wikimedia.org/T324759 [07:29:21] deployed! [07:29:41] thanks! :) [07:36:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 T335011', diff saved to https://phabricator.wikimedia.org/P47304 and previous config saved to /var/cache/conftool/dbconfig/20230503-073602-root.json [07:36:07] T335011: decommission db1110.eqiad.wmnet - https://phabricator.wikimedia.org/T335011 [07:36:33] (03PS1) 10Marostegui: db1110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/914702 (https://phabricator.wikimedia.org/T335011) [07:37:42] (03CR) 10Marostegui: [C: 03+2] db1110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/914702 (https://phabricator.wikimedia.org/T335011) (owner: 10Marostegui) [07:38:30] (SystemdUnitFailed) firing: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 3%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47305 and previous config saved to /var/cache/conftool/dbconfig/20230503-074313-root.json [07:43:17] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:43:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 3%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47306 and previous config saved to /var/cache/conftool/dbconfig/20230503-074323-root.json [07:44:00] (03CR) 10Superpes15: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil) [07:44:38] (03CR) 10CI reject: [V: 04-1] Create autopatroller and patroller groups on bn.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil) [07:44:50] (03PS1) 10Marostegui: db1118,db1110: Update notes [puppet] - 10https://gerrit.wikimedia.org/r/914703 (https://phabricator.wikimedia.org/T335011) [07:45:28] (03CR) 10Marostegui: [C: 03+2] db1118,db1110: Update notes [puppet] - 10https://gerrit.wikimedia.org/r/914703 (https://phabricator.wikimedia.org/T335011) (owner: 10Marostegui) [07:46:25] (03CR) 10Superpes15: "seems you used spaces instead of tab! please fix it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil) [07:48:22] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Install software version upgrade [07:52:23] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597 (owner: 10Ayounsi) [07:56:14] (03CR) 10Volans: [C: 04-1] "The new envs can't be run and the style one is not run by CI" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 (owner: 10Ayounsi) [07:56:19] (03PS2) 10Giuseppe Lavagetto: Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 [07:56:33] (03CR) 10CI reject: [V: 04-1] Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 (owner: 10Giuseppe Lavagetto) [07:57:03] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [07:58:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 4%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47307 and previous config saved to /var/cache/conftool/dbconfig/20230503-075818-root.json [07:58:21] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:58:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 4%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47308 and previous config saved to /var/cache/conftool/dbconfig/20230503-075828-root.json [08:01:18] (03PS3) 10Giuseppe Lavagetto: Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 [08:01:28] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:32] (03CR) 10CI reject: [V: 04-1] Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 (owner: 10Giuseppe Lavagetto) [08:02:23] (03PS4) 10Giuseppe Lavagetto: Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 [08:04:02] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [08:12:21] jouncebot: nowandnext [08:12:21] No deployments scheduled for the next 1 hour(s) and 47 minute(s) [08:12:21] In 1 hour(s) and 47 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1000) [08:12:51] (03CR) 10Urbanecm: [C: 03+2] Personalized praise: Let mentors to skip suggestions [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914426 (https://phabricator.wikimedia.org/T334300) (owner: 10Urbanecm) [08:13:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 5%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47309 and previous config saved to /var/cache/conftool/dbconfig/20230503-081323-root.json [08:13:27] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:13:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 5%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47310 and previous config saved to /var/cache/conftool/dbconfig/20230503-081332-root.json [08:14:25] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10HTTPS: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Clement_Goubert) [08:15:27] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10HTTPS: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Clement_Goubert) New internal certs now include `wikifunctions.org` an... [08:15:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914426 (https://phabricator.wikimedia.org/T334300) (owner: 10Urbanecm) [08:16:55] (03CR) 10JMeybohm: [C: 03+1] Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 (owner: 10Giuseppe Lavagetto) [08:28:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 10%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47311 and previous config saved to /var/cache/conftool/dbconfig/20230503-082827-root.json [08:28:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [08:28:32] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:28:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 10%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47312 and previous config saved to /var/cache/conftool/dbconfig/20230503-082837-root.json [08:32:08] (03Merged) 10jenkins-bot: Personalized praise: Let mentors to skip suggestions [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914426 (https://phabricator.wikimedia.org/T334300) (owner: 10Urbanecm) [08:32:37] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914426|Personalized praise: Let mentors to skip suggestions (T334300)]] [08:32:40] T334300: Personalized praise: design for skipping praiseworthy mentees - https://phabricator.wikimedia.org/T334300 [08:38:14] (03PS5) 10Urbanecm: [Growth] Add GEMentorDashboardEnabledModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914373 (https://phabricator.wikimedia.org/T334630) [08:38:34] (03PS4) 10Urbanecm: [Growth] Remove GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914375 (https://phabricator.wikimedia.org/T334630) [08:38:41] (03PS4) 10Urbanecm: [Growth] Deploy Personalized praise to AR, BN, CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914393 (https://phabricator.wikimedia.org/T334630) [08:39:29] !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Install software version upgrade [08:39:42] !log dbmaint deploy schema change on eqiad s3 with replication T335834 [08:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:45] T335834: Update cx_section_translations table - https://phabricator.wikimedia.org/T335834 [08:41:18] (03PS1) 10Hashar: ci: daily update all git cache repositories [puppet] - 10https://gerrit.wikimedia.org/r/914710 [08:41:42] (03CR) 10CI reject: [V: 04-1] ci: daily update all git cache repositories [puppet] - 10https://gerrit.wikimedia.org/r/914710 (owner: 10Hashar) [08:43:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 25%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47313 and previous config saved to /var/cache/conftool/dbconfig/20230503-084332-root.json [08:43:36] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:43:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 25%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47314 and previous config saved to /var/cache/conftool/dbconfig/20230503-084342-root.json [08:44:57] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Install software version upgrade [08:47:35] (03PS2) 10Hashar: ci: daily update all git cache repositories [puppet] - 10https://gerrit.wikimedia.org/r/914710 [08:48:38] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10ayounsi) > The obvious solution is to allow passing of the specific IP to use, and default to $facts['ipaddre... [08:58:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 50%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47315 and previous config saved to /var/cache/conftool/dbconfig/20230503-085837-root.json [08:58:41] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:58:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 50%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47316 and previous config saved to /var/cache/conftool/dbconfig/20230503-085847-root.json [08:59:45] (03PS1) 10Hashar: ci: add a couple extensions to git cache [puppet] - 10https://gerrit.wikimedia.org/r/914711 [09:00:17] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:914426|Personalized praise: Let mentors to skip suggestions (T334300)]] (duration: 27m 39s) [09:00:22] T334300: Personalized praise: design for skipping praiseworthy mentees - https://phabricator.wikimedia.org/T334300 [09:01:22] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief1001.eqiad.wmnet [09:01:35] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host acmechief1001.eqiad.wmnet [09:02:11] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief1001.eqiad.wmnet [09:02:56] (03CR) 10Subramanya Sastry: "I am going to try to get this deployed in a backport window today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T335157) (owner: 10C. Scott Ananian) [09:05:43] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1001.eqiad.wmnet [09:05:55] marostegui: Re: T335834, we are testing on testwiki. Need some time for that. [09:05:56] T335834: Update cx_section_translations table - https://phabricator.wikimedia.org/T335834 [09:06:15] kart_: No rush, I won't probably get to wikishared till next week anyways [09:06:38] marostegui: Thanks. [09:08:33] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief2001.codfw.wmnet [09:11:42] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914426|Personalized praise: Let mentors to skip suggestions (T334300)]] [09:11:44] T334300: Personalized praise: design for skipping praiseworthy mentees - https://phabricator.wikimedia.org/T334300 [09:11:48] !log urbanecm@deploy1002 sync-world aborted: Backport for [[gerrit:914426|Personalized praise: Let mentors to skip suggestions (T334300)]] (duration: 00m 06s) [09:12:20] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief2001.codfw.wmnet [09:12:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914373 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [09:12:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 (owner: 10Giuseppe Lavagetto) [09:13:20] (03Merged) 10jenkins-bot: Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 (owner: 10Giuseppe Lavagetto) [09:13:29] (03Merged) 10jenkins-bot: [Growth] Add GEMentorDashboardEnabledModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914373 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [09:13:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 75%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47317 and previous config saved to /var/cache/conftool/dbconfig/20230503-091342-root.json [09:13:46] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [09:13:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 75%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47318 and previous config saved to /var/cache/conftool/dbconfig/20230503-091352-root.json [09:13:58] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914373|[Growth] Add GEMentorDashboardEnabledModules (T334630)]] [09:14:01] T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630 [09:14:18] (03PS1) 10Urbanecm: Personalized praise: Run convertNumber() before displaying numbers [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914435 (https://phabricator.wikimedia.org/T322443) [09:14:29] (03PS1) 10Urbanecm: Personalized praise: Run convertNumber() before displaying numbers [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914436 (https://phabricator.wikimedia.org/T322443) [09:15:39] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:17:03] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet [09:17:20] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [09:19:24] (03PS1) 10Volans: Release v0.6.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/914716 [09:20:25] RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:39] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:20:47] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test1001.eqiad.wmnet [09:20:55] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:914373|[Growth] Add GEMentorDashboardEnabledModules (T334630)]] (duration: 06m 56s) [09:20:58] T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630 [09:21:16] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test2001.codfw.wmnet [09:21:44] (03CR) 10Ayounsi: [C: 03+1] Release v0.6.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/914716 (owner: 10Volans) [09:21:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914436 (https://phabricator.wikimedia.org/T322443) (owner: 10Urbanecm) [09:21:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914435 (https://phabricator.wikimedia.org/T322443) (owner: 10Urbanecm) [09:22:49] (03PS4) 10Giuseppe Lavagetto: recommendation-api: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905943 (https://phabricator.wikimedia.org/T334062) (owner: 10Clément Goubert) [09:23:30] (SystemdUnitFailed) resolved: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:21] (03CR) 10Volans: [C: 03+2] Release v0.6.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/914716 (owner: 10Volans) [09:24:21] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org [09:24:26] (03PS1) 10Marostegui: db2124: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/914717 (https://phabricator.wikimedia.org/T334650) [09:24:45] (03PS5) 10Giuseppe Lavagetto: changeprop: make num_workers configurable for jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/826570 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [09:24:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2124.codfw.wmnet with reason: Migrating to 10.6 and rebooting [09:25:00] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test2001.codfw.wmnet [09:25:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2124.codfw.wmnet with reason: Migrating to 10.6 and rebooting [09:25:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2124', diff saved to https://phabricator.wikimedia.org/P47319 and previous config saved to /var/cache/conftool/dbconfig/20230503-092513-root.json [09:26:42] !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw [09:27:13] (03CR) 10Marostegui: [C: 03+2] db2124: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/914717 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [09:28:19] !log volans@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.2 - volans@cumin1001 [09:28:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 100%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47320 and previous config saved to /var/cache/conftool/dbconfig/20230503-092847-root.json [09:28:50] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [09:28:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 100%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47321 and previous config saved to /var/cache/conftool/dbconfig/20230503-092856-root.json [09:29:37] (03PS1) 10Marostegui: db2124: Enable notications [puppet] - 10https://gerrit.wikimedia.org/r/914718 [09:29:45] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2003.wikimedia.org [09:29:57] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.2 - volans@cumin1001 [09:30:31] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netbox: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997 (10jbond) [09:31:24] (03CR) 10Jbond: [C: 03+1] Revert "sre.hosts.reimage/sre.ganeti.reimage: Delete Puppet state file before reimage" [cookbooks] - 10https://gerrit.wikimedia.org/r/912311 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [09:33:38] (03CR) 10Marostegui: [C: 03+2] db2124: Enable notications [puppet] - 10https://gerrit.wikimedia.org/r/914718 (owner: 10Marostegui) [09:34:48] (03CR) 10Jbond: [C: 03+1] "not tested but lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/910461 (https://phabricator.wikimedia.org/T334880) (owner: 10Volans) [09:35:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 1%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47322 and previous config saved to /var/cache/conftool/dbconfig/20230503-093503-root.json [09:35:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [09:36:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [09:36:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T335838)', diff saved to https://phabricator.wikimedia.org/P47323 and previous config saved to /var/cache/conftool/dbconfig/20230503-093606-ladsgroup.json [09:36:21] jouncebot: nowandnext [09:36:21] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [09:36:21] In 0 hour(s) and 23 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1000) [09:36:35] I’d like to deploy some backports ahead of the UTC afternoon window, if that’s okay [09:36:55] !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: Install software version upgrade [09:38:08] No objection from me [09:38:24] (03PS1) 10Lucas Werkmeister (WMDE): wblistentityusage: Deprecate wbeu prefix, new output format [extensions/Wikibase] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914437 (https://phabricator.wikimedia.org/T300460) [09:38:32] I'm migrating recommendation-api to mw-api-int in half an hour, but that shouldn't impact you I'd think [09:41:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914297 (https://phabricator.wikimedia.org/T300460) (owner: 10Michael Große) [09:41:21] starting with this one then ^ [09:41:33] (03PS1) 10Alexandros Kosiaris: machinetranslation: Support configuration as env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/914721 (https://phabricator.wikimedia.org/T331505) [09:41:34] (gate-and-submit will take some time in case anyone wants to stop me ^^) [09:41:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T335838)', diff saved to https://phabricator.wikimedia.org/P47324 and previous config saved to /var/cache/conftool/dbconfig/20230503-094135-ladsgroup.json [09:41:35] (03PS1) 10Alexandros Kosiaris: machinetranslation: Add people to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/914722 (https://phabricator.wikimedia.org/T331505) [09:42:03] huh, there’s two GrowthExperiments changes in gate-and-submit-wmf? [09:42:09] (03Merged) 10jenkins-bot: Personalized praise: Run convertNumber() before displaying numbers [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914436 (https://phabricator.wikimedia.org/T322443) (owner: 10Urbanecm) [09:42:11] (03Merged) 10jenkins-bot: Personalized praise: Run convertNumber() before displaying numbers [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914435 (https://phabricator.wikimedia.org/T322443) (owner: 10Urbanecm) [09:42:22] urbanecm: can you ping me when you’re done? [09:42:30] 10SRE, 10Infrastructure-Foundations: Implementation of request flow - https://phabricator.wikimedia.org/T335474 (10SLyngshede-WMF) a:03SLyngshede-WMF [09:42:40] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914436|Personalized praise: Run convertNumber() before displaying numbers (T322443)]], [[gerrit:914435|Personalized praise: Run convertNumber() before displaying numbers (T322443)]] [09:42:43] T322443: Personalized praise: new mentor dashboard module - https://phabricator.wikimedia.org/T322443 [09:44:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 4:00:00 on db1110.eqiad.wmnet with reason: Moving to m3 T335092 [09:44:16] T335092: Move db1110 to m3 - https://phabricator.wikimedia.org/T335092 [09:44:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 4:00:00 on db1110.eqiad.wmnet with reason: Moving to m3 T335092 [09:46:33] (03PS1) 10Marostegui: db1110: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/914723 [09:47:13] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-codfw [09:47:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Support configuration as env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/914721 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [09:47:17] (03CR) 10Marostegui: [C: 03+2] db1110: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/914723 (owner: 10Marostegui) [09:47:21] (03PS13) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [09:47:39] !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad [09:47:49] (03CR) 10CI reject: [V: 04-1] [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [09:48:04] (03Merged) 10jenkins-bot: machinetranslation: Support configuration as env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/914721 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [09:49:34] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:914436|Personalized praise: Run convertNumber() before displaying numbers (T322443)]], [[gerrit:914435|Personalized praise: Run convertNumber() before displaying numbers (T322443)]] (duration: 06m 53s) [09:49:37] T322443: Personalized praise: new mentor dashboard module - https://phabricator.wikimedia.org/T322443 [09:50:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 3%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47325 and previous config saved to /var/cache/conftool/dbconfig/20230503-095008-root.json [09:50:24] Lucas_WMDE: sorry, missed your message. I'm done now. [09:50:30] np, thanks! [09:50:37] I’m still waiting for gate-and-submit [09:50:40] ack [09:51:27] PROBLEM - Host gitlab2003 is DOWN: PING CRITICAL - Packet loss = 100% [09:52:32] (03PS1) 10Filippo Giunchedi: ipmi: remove check_ipmi_sensor, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/914726 (https://phabricator.wikimedia.org/T332764) [09:52:56] (03CR) 10CI reject: [V: 04-1] ipmi: remove check_ipmi_sensor, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/914726 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [09:53:00] ^ working on gitlab2003 [09:53:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1217.eqiad.wmnet with reason: Cloning db1110 from db1217:3323 T335092 [09:53:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Add people to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/914722 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [09:53:40] T335092: Move db1110 to m3 - https://phabricator.wikimedia.org/T335092 [09:53:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1217.eqiad.wmnet with reason: Cloning db1110 from db1217:3323 T335092 [09:54:21] (03Merged) 10jenkins-bot: machinetranslation: Add people to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/914722 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [09:54:49] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41004/console" [puppet] - 10https://gerrit.wikimedia.org/r/914726 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [09:55:10] (03PS2) 10Filippo Giunchedi: ipmi: remove check_ipmi_sensor, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/914726 (https://phabricator.wikimedia.org/T332764) [09:55:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [09:55:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [09:56:08] (03CR) 10Ayounsi: Replace most .format() to f-string (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597 (owner: 10Ayounsi) [09:56:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P47327 and previous config saved to /var/cache/conftool/dbconfig/20230503-095641-ladsgroup.json [09:57:07] RECOVERY - Host gitlab2003 is UP: PING OK - Packet loss = 0%, RTA = 34.59 ms [09:57:33] proxies, expected v [09:57:36] (03PS1) 10Marostegui: mariadb: Move db1110 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/914727 (https://phabricator.wikimedia.org/T335092) [09:57:37] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:58:10] (03PS1) 10Elukey: conftool-data: add config for the k8s ingress for ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914728 (https://phabricator.wikimedia.org/T335756) [09:58:33] PROBLEM - SSH on gitlab2003 is CRITICAL: connect to address 208.80.153.52 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:58:41] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:58:42] (03CR) 10Ayounsi: Add style checker and auto-formater to tox (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 (owner: 10Ayounsi) [09:58:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [09:58:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [09:59:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T335838)', diff saved to https://phabricator.wikimedia.org/P47328 and previous config saved to /var/cache/conftool/dbconfig/20230503-095901-ladsgroup.json [10:00:01] (03Merged) 10jenkins-bot: wblistentityusage: Deprecate wbeu prefix, new output format [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914297 (https://phabricator.wikimedia.org/T300460) (owner: 10Michael Große) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1000) [10:00:08] (03PS3) 10Ayounsi: Replace most .format() to f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597 [10:00:10] (03PS3) 10Ayounsi: Add style checker and auto-formater to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 [10:00:12] (03PS7) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [10:00:17] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder) [10:00:32] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:914297|wblistentityusage: Deprecate wbeu prefix, new output format (T300460 T196962)]] [10:00:37] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite2004.codfw.wmnet [10:00:42] T196962: Module prefix 'wbeu' is shared between Wikibase\Client\Api\ApiListEntityUsage and Wikibase\Client\Api\ApiPropsEntityUsage - https://phabricator.wikimedia.org/T196962 [10:00:43] T300460: [API] WikibaseClient: wblistentityusage API module adds its results to the `query.pages` key in response - https://phabricator.wikimedia.org/T300460 [10:00:59] ok, it merged [10:01:12] I was hoping this would finish before the deploy window :/ [10:01:16] No worries [10:01:20] (03CR) 10CI reject: [V: 04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [10:01:41] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:02:41] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host arclamp1001.eqiad.wmnet [10:02:59] (03CR) 10Ayounsi: [C: 03+2] Replace most .format() to f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597 (owner: 10Ayounsi) [10:03:18] (03CR) 10Volans: "reply inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 (owner: 10Ayounsi) [10:03:30] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host webperf1003.eqiad.wmnet [10:03:56] (03Merged) 10jenkins-bot: Replace most .format() to f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597 (owner: 10Ayounsi) [10:04:17] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:04:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T335838)', diff saved to https://phabricator.wikimedia.org/P47329 and previous config saved to /var/cache/conftool/dbconfig/20230503-100420-ladsgroup.json [10:04:21] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:04:25] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:04:41] PROBLEM - Host gitlab2003 is DOWN: PING CRITICAL - Packet loss = 100% [10:05:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 5%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47330 and previous config saved to /var/cache/conftool/dbconfig/20230503-100513-root.json [10:05:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:05:24] haproxy alerts are expected [10:06:35] (03CR) 10Jbond: [C: 03+1] "lgtm couple of minor nits" [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [10:06:41] build-and-push-container-images is taking a while for me [10:07:24] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1003.eqiad.wmnet [10:07:27] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host webperf2003.codfw.wmnet [10:07:33] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:07:37] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:07:42] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2004.codfw.wmnet [10:07:43] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:08:11] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:08:21] (03PS1) 10Alexandros Kosiaris: machinetranslation: Fix egress dst_nets indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/914729 [10:09:01] (03PS14) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [10:09:13] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host arclamp1001.eqiad.wmnet [10:09:36] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-eqiad [10:09:57] (03PS4) 10Ayounsi: Add style checker and auto-formater to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 [10:09:59] (03PS8) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [10:10:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Fix egress dst_nets indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/914729 (owner: 10Alexandros Kosiaris) [10:10:18] (03CR) 10Ayounsi: Add style checker and auto-formater to tox (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 (owner: 10Ayounsi) [10:10:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:10:21] RECOVERY - Host gitlab2003 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [10:10:39] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2003.codfw.wmnet [10:10:42] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host arclamp2001.codfw.wmnet [10:10:46] (03Merged) 10jenkins-bot: machinetranslation: Fix egress dst_nets indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/914729 (owner: 10Alexandros Kosiaris) [10:10:51] (03CR) 10Jbond: Jupyterhub-conda exclude /mnt from accessible paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene) [10:11:05] (03CR) 10CI reject: [V: 04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [10:11:32] (03CR) 10Volans: [C: 03+1] "I've not tested it but LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 (owner: 10Ayounsi) [10:11:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] InitialiseSettings.php: Change termbox url for testwikidatawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914274 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [10:11:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P47331 and previous config saved to /var/cache/conftool/dbconfig/20230503-101147-ladsgroup.json [10:12:30] (03PS1) 10Elukey: Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T330414) [10:12:53] ah, my build-and-push-container-images finally finished [10:12:57] (10m58s o_O) [10:13:21] (03CR) 10CI reject: [V: 04-1] Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [10:13:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/914404 (https://phabricator.wikimedia.org/T268344) (owner: 10JHathaway) [10:13:46] (03CR) 10Clément Goubert: InitialiseSettings.php: Change termbox url for testwikidatawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914274 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [10:13:49] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [10:14:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] InitialiseSettings.php: Change termbox url for testwikidatawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914274 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [10:14:14] jouncebot: now and next [10:14:14] For the next 0 hour(s) and 45 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1000) [10:14:41] I’m currently in the middle of a scap backport fyi [10:14:50] thank you Lucas_WMDE ! yeah was about to ask [10:15:01] I'll hold on to the graphite reboot, it can wait [10:15:06] not planning to backport any further changes after that though [10:15:09] I’ll ping you when I’m done [10:15:25] cheers [10:16:43] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on aphlict1001.eqiad.wmnet with reason: aphlict1002 is now active [10:16:57] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on aphlict1001.eqiad.wmnet with reason: aphlict1002 is now active [10:17:16] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host arclamp2001.codfw.wmnet [10:17:35] (03CR) 10Jbond: [C: 03+1] "lgtm. fyi i did start converting theses to puppet functions[1] but need to refresh the work. perhaps doing them in puppet was to optimis" [puppet] - 10https://gerrit.wikimedia.org/r/914406 (owner: 10JHathaway) [10:17:37] (03CR) 10Ayounsi: [C: 03+2] Add style checker and auto-formater to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 (owner: 10Ayounsi) [10:18:08] (03Merged) 10jenkins-bot: Add style checker and auto-formater to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 (owner: 10Ayounsi) [10:18:20] (03CR) 10Stevemunene: Jupyterhub-conda exclude /mnt from accessible paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene) [10:18:24] (03CR) 10Jbond: [C: 03+1] puppet7: re-add host core [puppet] - 10https://gerrit.wikimedia.org/r/914408 (owner: 10JHathaway) [10:18:44] !log jelto@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host gitlab2003.wikimedia.org [10:18:46] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and migr: Backport for [[gerrit:914297|wblistentityusage: Deprecate wbeu prefix, new output format (T300460 T196962)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [10:18:50] T196962: Module prefix 'wbeu' is shared between Wikibase\Client\Api\ApiListEntityUsage and Wikibase\Client\Api\ApiPropsEntityUsage - https://phabricator.wikimedia.org/T196962 [10:18:51] T300460: [API] WikibaseClient: wblistentityusage API module adds its results to the `query.pages` key in response - https://phabricator.wikimedia.org/T300460 [10:18:51] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [10:18:57] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [10:19:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P47332 and previous config saved to /var/cache/conftool/dbconfig/20230503-101926-ladsgroup.json [10:19:27] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2003.wikimedia.org [10:19:55] (03PS2) 10Elukey: Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T330414) [10:20:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 10%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47333 and previous config saved to /var/cache/conftool/dbconfig/20230503-102018-root.json [10:20:52] (03CR) 10CI reject: [V: 04-1] Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [10:21:13] (03PS1) 10Elukey: Add conftool and service config for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914735 (https://phabricator.wikimedia.org/T330414) [10:21:22] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [10:22:06] hm, it’s not really working as I would expect [10:22:46] (03PS1) 10Volans: Upstream release v3.2.9 with WMF modifications (2) [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/914736 [10:23:05] RECOVERY - SSH on gitlab2003 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:23:30] (JobUnavailable) firing: (4) Reduced availability for job thanos-sidecar in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:23:32] (03PS2) 10Volans: Upstream release v3.2.9 with WMF modifications (2) [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/914736 [10:24:32] (03PS3) 10Elukey: Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T330414) [10:24:43] (03CR) 10Ayounsi: [C: 03+1] Upstream release v3.2.9 with WMF modifications (2) [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/914736 (owner: 10Volans) [10:24:46] wait, I think I’ve been testing the wrong API, nevermind [10:25:34] (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:25:53] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2003.wikimedia.org [10:25:59] yup, works as expected when I use list=wblistentityusage instead of prop=wbentityusage [10:26:04] syncing [10:26:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T335838)', diff saved to https://phabricator.wikimedia.org/P47334 and previous config saved to /var/cache/conftool/dbconfig/20230503-102654-ladsgroup.json [10:27:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [10:27:01] (03CR) 10Volans: [V: 03+2 C: 03+2] Upstream release v3.2.9 with WMF modifications (2) [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/914736 (owner: 10Volans) [10:27:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [10:27:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T335838)', diff saved to https://phabricator.wikimedia.org/P47335 and previous config saved to /var/cache/conftool/dbconfig/20230503-102719-ladsgroup.json [10:27:33] (03PS2) 10Elukey: conftool-data: add config for the k8s ingress for ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914728 (https://phabricator.wikimedia.org/T335756) [10:27:35] (03PS2) 10Elukey: Add conftool and service config for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914735 (https://phabricator.wikimedia.org/T335756) [10:27:41] (03PS4) 10Elukey: Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T335756) [10:28:21] (03PS1) 10Hnowlan: admin_ng: increase thumbor resource limits, eqiad replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914737 (https://phabricator.wikimedia.org/T335681) [10:28:24] (03CR) 10Btullis: Jupyterhub-conda exclude /mnt from accessible paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene) [10:28:29] (03Restored) 10Btullis: Jupyterhub-conda exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene) [10:29:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:29:41] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:29:49] (03PS2) 10Hnowlan: admin_ng: increase thumbor resource limits, eqiad replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914737 (https://phabricator.wikimedia.org/T334488) [10:30:19] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:32:26] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1004.eqiad.wmnet [10:32:54] (JobUnavailable) firing: (5) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:33:43] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet [10:33:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T335838)', diff saved to https://phabricator.wikimedia.org/P47336 and previous config saved to /var/cache/conftool/dbconfig/20230503-103345-ladsgroup.json [10:33:57] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kubestagemaster2001.codfw.wmnet [10:34:11] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet [10:34:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:34:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P47337 and previous config saved to /var/cache/conftool/dbconfig/20230503-103433-ladsgroup.json [10:35:23] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001 [10:35:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 25%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47338 and previous config saved to /var/cache/conftool/dbconfig/20230503-103523-root.json [10:35:25] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:914297|wblistentityusage: Deprecate wbeu prefix, new output format (T300460 T196962)]] (duration: 34m 53s) [10:35:30] T196962: Module prefix 'wbeu' is shared between Wikibase\Client\Api\ApiListEntityUsage and Wikibase\Client\Api\ApiPropsEntityUsage - https://phabricator.wikimedia.org/T196962 [10:35:31] T300460: [API] WikibaseClient: wblistentityusage API module adds its results to the `query.pages` key in response - https://phabricator.wikimedia.org/T300460 [10:35:34] * Lucas_WMDE done [10:35:40] godog: you’re good to go as far as I’m concerned [10:36:42] Lucas_WMDE: thank you! appreciate it [10:38:30] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1005.eqiad.wmnet [10:38:58] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1004.eqiad.wmnet [10:39:01] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2004.codfw.wmnet [10:39:19] (03CR) 10Clément Goubert: [C: 03+2] recommendation-api: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905943 (https://phabricator.wikimedia.org/T334062) (owner: 10Clément Goubert) [10:40:15] (03Merged) 10jenkins-bot: recommendation-api: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905943 (https://phabricator.wikimedia.org/T334062) (owner: 10Clément Goubert) [10:40:26] !log Migrating recommendation-api staging to mw-api-int-async - T334062 [10:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:29] T334062: Migrate recommendation-api to mw-api-int - https://phabricator.wikimedia.org/T334062 [10:40:32] jouncebot: I'm loving that speedy CI [10:40:38] _joe_* [10:41:39] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001 [10:45:26] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2004.codfw.wmnet [10:45:34] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1005.eqiad.wmnet [10:45:38] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [10:45:57] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kubestagemaster2001.codfw.wmnet [10:46:05] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:25] Checking [10:47:41] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:49] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [10:48:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P47339 and previous config saved to /var/cache/conftool/dbconfig/20230503-104851-ladsgroup.json [10:49:25] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/recommendation-api: apply [10:49:31] (03PS5) 10Muehlenhoff: apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 [10:49:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T335838)', diff saved to https://phabricator.wikimedia.org/P47340 and previous config saved to /var/cache/conftool/dbconfig/20230503-104939-ladsgroup.json [10:49:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [10:49:56] (03CR) 10CI reject: [V: 04-1] apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [10:49:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [10:50:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T335838)', diff saved to https://phabricator.wikimedia.org/P47341 and previous config saved to /var/cache/conftool/dbconfig/20230503-105004-ladsgroup.json [10:50:10] !log Migrating recommendation-api codfw to mw-api-int-async - T334062 [10:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:13] T334062: Migrate recommendation-api to mw-api-int - https://phabricator.wikimedia.org/T334062 [10:50:15] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply [10:50:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 50%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47342 and previous config saved to /var/cache/conftool/dbconfig/20230503-105028-root.json [10:51:32] !log Migrating recommendation-api eqiad to mw-api-int-async - T334062 [10:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:38] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply [10:52:10] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply [10:53:43] (03PS6) 10Muehlenhoff: apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 [10:54:08] (03CR) 10CI reject: [V: 04-1] apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [10:55:30] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1110 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/914727 (https://phabricator.wikimedia.org/T335092) (owner: 10Marostegui) [10:56:04] (03PS15) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [10:56:34] (03PS1) 10Hashar: tox: do not skip missing interpreters on CI [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914746 [10:56:36] (03PS1) 10Hashar: tox: use default python for local testing [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914747 [10:56:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T335838)', diff saved to https://phabricator.wikimedia.org/P47343 and previous config saved to /var/cache/conftool/dbconfig/20230503-105639-ladsgroup.json [10:57:35] (03CR) 10Hashar: "I came up with that pattern in Quibble https://gerrit.wikimedia.org/r/c/integration/quibble/+/607512 so I can just `tox` locally and do n" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914747 (owner: 10Hashar) [10:57:39] (03PS1) 10EoghanGaffney: [gitlab/failover] Add rollback method [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 [10:57:42] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster1001.eqiad.wmnet [11:00:03] GitLab needs a short maintenance break in one hour (12:00 UTC). For around 15 minutes GitLab, GitLab CI and most probably Phabricator will not be available [11:00:30] ack [11:00:46] (03CR) 10CI reject: [V: 04-1] [gitlab/failover] Add rollback method [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 (owner: 10EoghanGaffney) [11:00:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbproxy[2001-2004].codfw.wmnet with reason: Reboot T335845 [11:01:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy[2001-2004].codfw.wmnet with reason: Reboot T335845 [11:01:10] (03PS1) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) [11:02:04] (03CR) 10CI reject: [V: 04-1] templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez) [11:02:52] !log Reboot dbproxy200[1-4] [11:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:07] (03PS1) 10Lucas Werkmeister (WMDE): Make wbsubscribers API output sensible on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) [11:03:45] (03PS2) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) [11:03:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P47344 and previous config saved to /var/cache/conftool/dbconfig/20230503-110357-ladsgroup.json [11:04:33] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [11:04:44] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [11:05:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 75%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47345 and previous config saved to /var/cache/conftool/dbconfig/20230503-110532-root.json [11:05:35] (03CR) 10Lucas Werkmeister (WMDE): "diffConfig looks as expected to me – testwikidatawiki true, wikidatawiki false \o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE)) [11:06:01] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10ERayfield) my bad, sorry! didn't know that the sig had not gone through - thanks @JKieserman ! [11:06:31] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster1001.eqiad.wmnet [11:07:09] (03PS9) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [11:07:11] (03PS1) 10Ayounsi: Fix multiple pylint inconsistencies [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914753 [11:07:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:07:50] (03PS1) 10Hashar: Checkout tested patch in a branch [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914754 [11:08:14] (03CR) 10CI reject: [V: 04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [11:08:30] (03CR) 10Jbond: [C: 04-1] "-1: some minor issues see inline" [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [11:08:50] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Install software version upgrade [11:09:56] (03CR) 10Ayounsi: "Messages Found: 252 with c4d44ff08b50edc6508894a0e444e4088474b335" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [11:10:28] (03CR) 10CI reject: [V: 04-1] Checkout tested patch in a branch [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914754 (owner: 10Hashar) [11:10:32] (03CR) 10Ayounsi: "down to ~250." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914753 (owner: 10Ayounsi) [11:10:47] (03PS1) 10Volans: python_deploy: set the setgid bit on the git clone [puppet] - 10https://gerrit.wikimedia.org/r/914755 [11:11:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P47346 and previous config saved to /var/cache/conftool/dbconfig/20230503-111145-ladsgroup.json [11:11:49] !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-codfw [11:11:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbproxy1014.eqiad.wmnet with reason: Upgrade [11:11:55] (03CR) 10Ayounsi: [C: 03+1] python_deploy: set the setgid bit on the git clone [puppet] - 10https://gerrit.wikimedia.org/r/914755 (owner: 10Volans) [11:11:57] (03PS1) 10Majavah: build: format scripts/ with black too [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914756 [11:11:59] (03PS1) 10Majavah: webservice: set argparse help program correctly in toolforge-cli [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914757 [11:12:03] (03PS1) 10Majavah: debian: provision toolforge-webservice symlink [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914758 [11:12:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy1014.eqiad.wmnet with reason: Upgrade [11:12:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbproxy1015.eqiad.wmnet with reason: Upgrade [11:12:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy1015.eqiad.wmnet with reason: Upgrade [11:12:44] (03CR) 10CI reject: [V: 04-1] webservice: set argparse help program correctly in toolforge-cli [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914757 (owner: 10Majavah) [11:13:14] (03PS2) 10Majavah: webservice: set argparse help program correctly in toolforge-cli [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914757 [11:13:16] (03PS2) 10Majavah: debian: provision toolforge-webservice symlink [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914758 [11:13:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbproxy1016.eqiad.wmnet with reason: Upgrade [11:13:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy1016.eqiad.wmnet with reason: Upgrade [11:13:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbproxy1017.eqiad.wmnet with reason: Upgrade [11:13:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy1017.eqiad.wmnet with reason: Upgrade [11:14:05] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:15:53] (03PS1) 10Majavah: webservice: Improve --buildservice-image help message [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914759 [11:17:25] (03CR) 10Volans: [C: 03+2] python_deploy: set the setgid bit on the git clone [puppet] - 10https://gerrit.wikimedia.org/r/914755 (owner: 10Volans) [11:18:16] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001 [11:19:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Decom db1111 T335836', diff saved to https://phabricator.wikimedia.org/P47347 and previous config saved to /var/cache/conftool/dbconfig/20230503-111904-ladsgroup.json [11:19:07] T335836: decommission db1111.eqiad.wmnet - https://phabricator.wikimedia.org/T335836 [11:19:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T335838)', diff saved to https://phabricator.wikimedia.org/P47348 and previous config saved to /var/cache/conftool/dbconfig/20230503-111910-ladsgroup.json [11:19:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [11:19:23] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:19:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [11:19:31] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:20:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 100%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47349 and previous config saved to /var/cache/conftool/dbconfig/20230503-112037-root.json [11:20:48] (03PS1) 10Ladsgroup: conftool-data: Remove db1111 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/914760 (https://phabricator.wikimedia.org/T335836) [11:21:36] (03CR) 10Marostegui: [C: 03+1] conftool-data: Remove db1111 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/914760 (https://phabricator.wikimedia.org/T335836) (owner: 10Ladsgroup) [11:22:31] (03PS2) 10Ladsgroup: conftool-data: Remove db1111 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/914760 (https://phabricator.wikimedia.org/T335836) [11:22:36] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] conftool-data: Remove db1111 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/914760 (https://phabricator.wikimedia.org/T335836) (owner: 10Ladsgroup) [11:23:07] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001 [11:23:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [11:24:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [11:24:33] (03PS7) 10Jbond: apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [11:25:39] elukey: I think the uncommitted dns changes are yours... k8s-ingress-ml-staging [11:25:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/914726 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [11:26:39] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:26:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [11:26:44] also the .83 IP was not reserved on the eqiad prefix elukey [11:26:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [11:27:17] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001 [11:27:36] (03CR) 10Jbond: jupyterhub-conda: Fix incompatibility with HDFS-FUSE mount (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene) [11:28:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Remove db1111 from dbctl T335836', diff saved to https://phabricator.wikimedia.org/P47350 and previous config saved to /var/cache/conftool/dbconfig/20230503-112812-ladsgroup.json [11:28:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene) [11:28:16] T335836: decommission db1111.eqiad.wmnet - https://phabricator.wikimedia.org/T335836 [11:28:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T335838)', diff saved to https://phabricator.wikimedia.org/P47351 and previous config saved to /var/cache/conftool/dbconfig/20230503-112819-ladsgroup.json [11:28:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2151 (T335838)', diff saved to https://phabricator.wikimedia.org/P47352 and previous config saved to /var/cache/conftool/dbconfig/20230503-112819-ladsgroup.json [11:28:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P47353 and previous config saved to /var/cache/conftool/dbconfig/20230503-112828-ladsgroup.json [11:28:49] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001 [11:31:37] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001 [11:31:41] (03CR) 10Jbond: "pcc looks strange:" [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [11:33:25] (03PS8) 10Muehlenhoff: apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 [11:34:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T335838)', diff saved to https://phabricator.wikimedia.org/P47354 and previous config saved to /var/cache/conftool/dbconfig/20230503-113441-ladsgroup.json [11:35:20] (03PS1) 10Ladsgroup: mariadb: Remove puppet entries for db1111 [puppet] - 10https://gerrit.wikimedia.org/r/914762 (https://phabricator.wikimedia.org/T335836) [11:35:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T335838)', diff saved to https://phabricator.wikimedia.org/P47355 and previous config saved to /var/cache/conftool/dbconfig/20230503-113524-ladsgroup.json [11:38:42] (03CR) 10Marostegui: [C: 03+1] mariadb: Remove puppet entries for db1111 [puppet] - 10https://gerrit.wikimedia.org/r/914762 (https://phabricator.wikimedia.org/T335836) (owner: 10Ladsgroup) [11:38:46] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes2015.codfw.wmnet [11:38:47] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2015.codfw.wmnet [11:38:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914746 (owner: 10Hashar) [11:39:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914747 (owner: 10Hashar) [11:40:04] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001 [11:40:41] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:42:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [11:43:13] (03CR) 10Majavah: [C: 03+1] apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [11:43:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T335838)', diff saved to https://phabricator.wikimedia.org/P47356 and previous config saved to /var/cache/conftool/dbconfig/20230503-114335-ladsgroup.json [11:43:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [11:44:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [11:44:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [11:44:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [11:44:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T335838)', diff saved to https://phabricator.wikimedia.org/P47357 and previous config saved to /var/cache/conftool/dbconfig/20230503-114426-ladsgroup.json [11:44:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [11:44:36] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Remove puppet entries for db1111 [puppet] - 10https://gerrit.wikimedia.org/r/914762 (https://phabricator.wikimedia.org/T335836) (owner: 10Ladsgroup) [11:44:41] (03CR) 10Jbond: "lgtm just needs the style fixing e.g. ./utils/check-style.sh" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914754 (owner: 10Hashar) [11:47:23] (03PS5) 10Slyngshede: Requisition approval functionality. [software/bitu] - 10https://gerrit.wikimedia.org/r/911249 [11:49:31] (03PS1) 10Marostegui: instances.yaml: Remove db1110 [puppet] - 10https://gerrit.wikimedia.org/r/914765 (https://phabricator.wikimedia.org/T335011) [11:49:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P47358 and previous config saved to /var/cache/conftool/dbconfig/20230503-114947-ladsgroup.json [11:49:59] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:50:30] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1110 [puppet] - 10https://gerrit.wikimedia.org/r/914765 (https://phabricator.wikimedia.org/T335011) (owner: 10Marostegui) [11:50:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P47359 and previous config saved to /var/cache/conftool/dbconfig/20230503-115030-ladsgroup.json [11:50:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:51:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1110 from dbctl T335011', diff saved to https://phabricator.wikimedia.org/P47360 and previous config saved to /var/cache/conftool/dbconfig/20230503-115124-marostegui.json [11:51:27] T335011: decommission db1110.eqiad.wmnet - https://phabricator.wikimedia.org/T335011 [11:51:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T335838)', diff saved to https://phabricator.wikimedia.org/P47361 and previous config saved to /var/cache/conftool/dbconfig/20230503-115130-ladsgroup.json [11:51:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1111.eqiad.wmnet [11:55:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:56:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [11:56:56] !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox [12:02:07] !log ladsgroup@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1111.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [12:04:51] (03PS1) 10Alexandros Kosiaris: machinetranslation: Use 2023-05-03-104124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914767 (https://phabricator.wikimedia.org/T331505) [12:04:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P47362 and previous config saved to /var/cache/conftool/dbconfig/20230503-120453-ladsgroup.json [12:04:56] (03PS3) 10Arturo Borrero Gonzalez: cloud_private_subnet: support using FQDNs instead of hardcoded IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) [12:05:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Use 2023-05-03-104124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914767 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [12:05:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P47363 and previous config saved to /var/cache/conftool/dbconfig/20230503-120536-ladsgroup.json [12:06:19] (03Merged) 10jenkins-bot: machinetranslation: Use 2023-05-03-104124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914767 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [12:06:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1111.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [12:06:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:06:34] !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: Install software version upgrade [12:06:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1111.eqiad.wmnet [12:06:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P47364 and previous config saved to /var/cache/conftool/dbconfig/20230503-120637-ladsgroup.json [12:06:52] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 3321 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:07:05] (03PS4) 10Arturo Borrero Gonzalez: cloud_private_subnet: support using FQDNs instead of hardcoded IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) [12:07:24] (ProbeDown) firing: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:07:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:08:02] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 145034 bytes in 1.933 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:08:28] jouncebot: now and next [12:08:28] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:08:28] No deployments scheduled for the next 0 hour(s) and 51 minute(s) [12:09:04] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog1002.eqiad.wmnet [12:09:11] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [12:10:08] (03CR) 10Filippo Giunchedi: [C: 03+2] ipmi: remove check_ipmi_sensor, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/914726 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [12:10:43] phabricator seems to be down again due to gitlab being down [12:11:30] !log Removing db1111 from zarcillo T335836 [12:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:34] T335836: decommission db1111.eqiad.wmnet - https://phabricator.wikimedia.org/T335836 [12:11:58] taavi: yes that's expected, see my message from 11:00 here or in gitlab/releng channel [12:12:10] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:12:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:12:42] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:12:54] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 3321 bytes in 7.224 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:13:56] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:14:02] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 144129 bytes in 2.203 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:14:42] RECOVERY - BFD status on cr2-eqiad is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:14:47] (03CR) 10Muehlenhoff: [C: 03+2] apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff) [12:15:17] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1002.eqiad.wmnet [12:16:35] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [12:17:19] (03CR) 10Ottomata: [C: 03+1] "I think one of the reasons for / being a readonly path for jupyter is to prevent some kind of writable access if a malicious actor was abl" [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene) [12:17:24] (ProbeDown) resolved: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:19:31] (03CR) 10JMeybohm: "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/914275 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [12:20:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T335838)', diff saved to https://phabricator.wikimedia.org/P47365 and previous config saved to /var/cache/conftool/dbconfig/20230503-122000-ladsgroup.json [12:20:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [12:20:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [12:20:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [12:20:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [12:20:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T335838)', diff saved to https://phabricator.wikimedia.org/P47366 and previous config saved to /var/cache/conftool/dbconfig/20230503-122040-ladsgroup.json [12:20:41] (03PS1) 10Ottomata: eventgate - bump image version to pick up new schemas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914768 (https://phabricator.wikimedia.org/T331401) [12:20:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T335838)', diff saved to https://phabricator.wikimedia.org/P47367 and previous config saved to /var/cache/conftool/dbconfig/20230503-122049-ladsgroup.json [12:20:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [12:21:06] (03CR) 10Andrew Bogott: [C: 03+2] Remove unused labstore code [puppet] - 10https://gerrit.wikimedia.org/r/914415 (owner: 10Andrew Bogott) [12:21:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [12:21:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T335838)', diff saved to https://phabricator.wikimedia.org/P47368 and previous config saved to /var/cache/conftool/dbconfig/20230503-122113-ladsgroup.json [12:21:21] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - bump image version to pick up new schemas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914768 (https://phabricator.wikimedia.org/T331401) (owner: 10Ottomata) [12:21:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P47369 and previous config saved to /var/cache/conftool/dbconfig/20230503-122143-ladsgroup.json [12:22:45] (03CR) 10JMeybohm: [C: 03+1] modules: add ml-staging cfg to the istio template [deployment-charts] - 10https://gerrit.wikimedia.org/r/914307 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [12:22:49] https://phabricator.wikimedia.org/T335836 I can't open this but I can open rest of phabricator tickets [12:22:51] (03CR) 10JMeybohm: [C: 03+1] modules: duplicate the istio ingress template for 1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914306 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [12:23:06] now I can open it [12:23:08] anyway [12:23:12] (03PS2) 10JMeybohm: admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491) [12:23:24] Amir1: yes, there's maintenance on gitlab per jelto's comment on -sre [12:23:52] thanks [12:24:06] phabricator should recover but it seems there is some caching and it needs some time [12:24:09] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [12:24:24] at least all of my tasks are loading again now [12:24:25] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [12:25:07] (03PS1) 10Ottomata: eventgate-main - bump image version to pick up new schemas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914769 (https://phabricator.wikimedia.org/T331401) [12:25:29] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [12:26:27] (03CR) 10JMeybohm: [C: 03+1] Add conftool and service config for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914735 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [12:26:44] (03CR) 10Arturo Borrero Gonzalez: cloud_private_subnet: support using FQDNs instead of hardcoded IP addresses (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [12:27:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T335838)', diff saved to https://phabricator.wikimedia.org/P47370 and previous config saved to /var/cache/conftool/dbconfig/20230503-122705-ladsgroup.json [12:27:50] (03CR) 10JMeybohm: [C: 03+1] Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [12:28:04] (03CR) 10Ottomata: [C: 03+2] eventgate-main - bump image version to pick up new schemas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914769 (https://phabricator.wikimedia.org/T331401) (owner: 10Ottomata) [12:28:11] 10ops-eqiad, 10decommission-hardware: decommission db1111.eqiad.wmnet - https://phabricator.wikimedia.org/T335836 (10Ladsgroup) Awesome. Thanks! [12:28:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [12:30:13] (03CR) 10David Caro: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914756 (owner: 10Majavah) [12:30:33] (03CR) 10David Caro: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914757 (owner: 10Majavah) [12:31:05] (03CR) 10David Caro: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914758 (owner: 10Majavah) [12:31:17] jouncebot: next [12:31:17] In 0 hour(s) and 28 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1300) [12:31:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T335838)', diff saved to https://phabricator.wikimedia.org/P47371 and previous config saved to /var/cache/conftool/dbconfig/20230503-123137-ladsgroup.json [12:31:38] (03CR) 10David Caro: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914759 (owner: 10Majavah) [12:31:39] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [12:32:09] (03PS3) 10Muehlenhoff: Use signed-by to in apt::package_from_component on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495) [12:32:13] (03CR) 10Muehlenhoff: Use signed-by to in apt::package_from_component on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [12:32:56] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:33:47] (03CR) 10Majavah: [C: 03+2] build: format scripts/ with black too [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914756 (owner: 10Majavah) [12:33:52] (03CR) 10Majavah: [C: 03+2] webservice: set argparse help program correctly in toolforge-cli [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914757 (owner: 10Majavah) [12:33:56] (03CR) 10Majavah: [C: 03+2] debian: provision toolforge-webservice symlink [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914758 (owner: 10Majavah) [12:33:58] (03CR) 10Majavah: [C: 03+2] webservice: Improve --buildservice-image help message [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914759 (owner: 10Majavah) [12:34:12] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:34:32] (03Merged) 10jenkins-bot: build: format scripts/ with black too [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914756 (owner: 10Majavah) [12:34:35] (03Merged) 10jenkins-bot: webservice: set argparse help program correctly in toolforge-cli [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914757 (owner: 10Majavah) [12:34:44] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:35:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [12:35:41] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [12:35:42] (03Merged) 10jenkins-bot: debian: provision toolforge-webservice symlink [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914758 (owner: 10Majavah) [12:35:46] (03Merged) 10jenkins-bot: webservice: Improve --buildservice-image help message [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914759 (owner: 10Majavah) [12:36:28] !log jelto@cumin1001 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner [12:36:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T335838)', diff saved to https://phabricator.wikimedia.org/P47372 and previous config saved to /var/cache/conftool/dbconfig/20230503-123649-ladsgroup.json [12:36:54] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:36:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [12:37:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [12:37:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47373 and previous config saved to /var/cache/conftool/dbconfig/20230503-123714-ladsgroup.json [12:37:34] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:37:54] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet [12:38:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47374 and previous config saved to /var/cache/conftool/dbconfig/20230503-123837-ladsgroup.json [12:42:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P47375 and previous config saved to /var/cache/conftool/dbconfig/20230503-124212-ladsgroup.json [12:44:14] (03CR) 10Ottomata: [V: 03+2 C: 03+2] "k8s node restarts are happening in codfw now so I have to wait a bit to deploy this..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/914768 (https://phabricator.wikimedia.org/T331401) (owner: 10Ottomata) [12:45:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47376 and previous config saved to /var/cache/conftool/dbconfig/20230503-124558-ladsgroup.json [12:46:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P47377 and previous config saved to /var/cache/conftool/dbconfig/20230503-124643-ladsgroup.json [12:47:19] (03PS1) 10Andrew Bogott: toolschecker: update list of expected etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/914771 [12:47:31] (03PS6) 10Slyngshede: Requisition approval functionality. [software/bitu] - 10https://gerrit.wikimedia.org/r/911249 [12:47:46] (03CR) 10Kamila Součková: "LGTM, but I don't have enough context to actually feel okay +1'ing this '^^" [deployment-charts] - 10https://gerrit.wikimedia.org/r/914737 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [12:48:34] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) [12:48:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolschecker: update list of expected etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/914771 (owner: 10Andrew Bogott) [12:49:02] (03CR) 10David Caro: [C: 03+1] toolschecker: update list of expected etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/914771 (owner: 10Andrew Bogott) [12:49:52] (03CR) 10Andrew Bogott: [C: 03+2] toolschecker: update list of expected etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/914771 (owner: 10Andrew Bogott) [12:50:49] (03CR) 10CI reject: [V: 04-1] cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [12:53:54] (03CR) 10Muehlenhoff: [C: 03+2] Remove router access for cmjohnson [homer/public] - 10https://gerrit.wikimedia.org/r/914260 (owner: 10Muehlenhoff) [12:55:44] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-dns-floating-ip-updater: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914412 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [12:55:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [12:57:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P47378 and previous config saved to /var/cache/conftool/dbconfig/20230503-125718-ladsgroup.json [12:58:14] (03PS2) 10Jbond: get_config: add specific get_config script for puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/912949 [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1300). [13:00:05] MichaelG_WMDE, subbu, and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:16] o/ [13:00:18] I can deploy! [13:00:34] hi [13:01:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P47379 and previous config saved to /var/cache/conftool/dbconfig/20230503-130105-ladsgroup.json [13:01:17] (03PS2) 10Lucas Werkmeister (WMDE): testwikidatawiki: enable entity labels in parsed API edit summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912309 (https://phabricator.wikimedia.org/T335098) (owner: 10Michael Große) [13:01:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912309 (https://phabricator.wikimedia.org/T335098) (owner: 10Michael Große) [13:01:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P47380 and previous config saved to /var/cache/conftool/dbconfig/20230503-130149-ladsgroup.json [13:02:27] (03Merged) 10jenkins-bot: testwikidatawiki: enable entity labels in parsed API edit summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912309 (https://phabricator.wikimedia.org/T335098) (owner: 10Michael Große) [13:02:56] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:912309|testwikidatawiki: enable entity labels in parsed API edit summaries (T335098)]] [13:02:59] T335098: Testwikidata: enable entity labels in parsed edit summaries in API requests - https://phabricator.wikimedia.org/T335098 [13:04:53] (03PS1) 10Jaime Nuche: beta: delete old files regularly from Puppet client bucket on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/914777 [13:05:00] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and migr: Backport for [[gerrit:912309|testwikidatawiki: enable entity labels in parsed API edit summaries (T335098)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:05:01] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "let’s start the gate-and-submit already" [extensions/Wikibase] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914437 (https://phabricator.wikimedia.org/T300460) (owner: 10Lucas Werkmeister (WMDE)) [13:05:59] hm, I don’t see a difference on https://test.wikidata.org/w/api.php?action=query&format=json&list=recentchanges&formatversion=2&rcnamespace=0&rcprop=parsedcomment [13:06:19] (03CR) 10Jbond: [C: 03+2] get_config: add specific get_config script for puppet7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912949 (owner: 10Jbond) [13:06:30] (03CR) 10Andrew Bogott: [C: 03+2] nfs-exportd.py: remove some dead code [puppet] - 10https://gerrit.wikimedia.org/r/913962 (owner: 10Andrew Bogott) [13:07:17] ah, but on https://test.wikidata.org/w/api.php?action=query&format=json&prop=revisions&revids=636981&formatversion=2&rvprop=comment|parsedcomment it works [13:07:33] does list=recentchanges not work the same way? o_O [13:07:37] but good to deploy for now, I think [13:07:39] syncing [13:07:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:08:31] (03CR) 10Muehlenhoff: [C: 03+2] Use signed-by to in apt::package_from_component on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [13:08:55] can confirm that it works on the debug server [13:09:14] 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [13:09:25] 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [13:09:36] 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [13:09:53] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10lojo) [13:10:48] (03PS3) 10Hashar: Use same php version for doc and integration websites [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) [13:11:07] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [13:12:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T335838)', diff saved to https://phabricator.wikimedia.org/P47381 and previous config saved to /var/cache/conftool/dbconfig/20230503-131224-ladsgroup.json [13:12:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:12:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:12:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T335838)', diff saved to https://phabricator.wikimedia.org/P47382 and previous config saved to /var/cache/conftool/dbconfig/20230503-131249-ladsgroup.json [13:12:58] (03PS1) 10Ottomata: page_content_change - bump image to v1.15.0-dev0 to debug OOM [deployment-charts] - 10https://gerrit.wikimedia.org/r/914780 (https://phabricator.wikimedia.org/T332948) [13:13:28] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [13:13:36] (03CR) 10Clément Goubert: [C: 04-1] "Holding for discussion on whether merging staging-test and staging is good idea." [deployment-charts] - 10https://gerrit.wikimedia.org/r/914275 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [13:13:43] (03CR) 10Ottomata: [C: 03+2] page_content_change - bump image to v1.15.0-dev0 to debug OOM [deployment-charts] - 10https://gerrit.wikimedia.org/r/914780 (https://phabricator.wikimedia.org/T332948) (owner: 10Ottomata) [13:14:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T335838)', diff saved to https://phabricator.wikimedia.org/P47383 and previous config saved to /var/cache/conftool/dbconfig/20230503-131414-ladsgroup.json [13:16:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P47384 and previous config saved to /var/cache/conftool/dbconfig/20230503-131611-ladsgroup.json [13:16:17] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:16:27] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:16:52] “Finished Running helmfile -e codfw --selector name=canary apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 07m 26s) ” o_O [13:16:54] 7½ minutes… [13:16:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T335838)', diff saved to https://phabricator.wikimedia.org/P47385 and previous config saved to /var/cache/conftool/dbconfig/20230503-131656-ladsgroup.json [13:17:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:17:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:17:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:17:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:17:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T335838)', diff saved to https://phabricator.wikimedia.org/P47386 and previous config saved to /var/cache/conftool/dbconfig/20230503-131736-ladsgroup.json [13:18:14] Lucas_WMDE: I'm rebooting kubernetes nodes in codfw, that's probably why [13:18:16] (03CR) 10Hashar: "PPC https://puppet-compiler.wmflabs.org/output/914731/1780/" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [13:18:20] ah ok [13:18:48] (03CR) 10Muehlenhoff: Use signed-by to include the Wikimedia repo starting with Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [13:18:51] it went faster with the non-canary apply at least [13:18:54] (03PS5) 10Muehlenhoff: Use signed-by to include the Wikimedia repo starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) [13:19:04] e.g. “Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 00m 26s)” [13:19:14] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner [13:19:52] The scheduler probably sent a canary pod to a node that didn't have the image yet [13:20:11] 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [13:20:13] oh right, and then it takes a while to download [13:20:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T335838)', diff saved to https://phabricator.wikimedia.org/P47387 and previous config saved to /var/cache/conftool/dbconfig/20230503-132022-ladsgroup.json [13:20:25] (03CR) 10Michael Große: [C: 03+1] "looks reasonable, but probably depends on Ib07b2acdf, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE)) [13:20:40] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:20:52] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:912309|testwikidatawiki: enable entity labels in parsed API edit summaries (T335098)]] (duration: 17m 55s) [13:20:54] T335098: Testwikidata: enable entity labels in parsed edit summaries in API requests - https://phabricator.wikimedia.org/T335098 [13:20:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914437 (https://phabricator.wikimedia.org/T300460) (owner: 10Lucas Werkmeister (WMDE)) [13:21:14] Yeah, the mediawiki image is a tad on the heavy side [13:21:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:21:45] (03CR) 10Lucas Werkmeister (WMDE): Make wbsubscribers API output sensible on Test Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE)) [13:22:54] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:23:03] XioNoX: Can that BGP alert be because of the reboots? [13:23:10] (03PS2) 10Lucas Werkmeister (WMDE): Make wbsubscribers API output sensible on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) [13:23:13] (and in that case should I downtime it for the duration) [13:23:18] (03CR) 10Herron: [C: 03+1] prometheus: Synchronize only the /srv/prometheus directory when migrating data [puppet] - 10https://gerrit.wikimedia.org/r/914400 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [13:23:21] s/can/should/ [13:23:29] claime: yep, looks like it [13:23:42] (in meeting) [13:23:46] ack [13:23:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T335838)', diff saved to https://phabricator.wikimedia.org/P47388 and previous config saved to /var/cache/conftool/dbconfig/20230503-132349-ladsgroup.json [13:23:57] (03CR) 10Michael Große: [C: 03+1] Make wbsubscribers API output sensible on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE)) [13:23:59] (03Merged) 10jenkins-bot: wblistentityusage: Deprecate wbeu prefix, new output format [extensions/Wikibase] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914437 (https://phabricator.wikimedia.org/T300460) (owner: 10Lucas Werkmeister (WMDE)) [13:24:12] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:24:15] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:24:20] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:24:30] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:914437|wblistentityusage: Deprecate wbeu prefix, new output format (T300460 T196962)]] [13:24:35] T196962: Module prefix 'wbeu' is shared between Wikibase\Client\Api\ApiListEntityUsage and Wikibase\Client\Api\ApiPropsEntityUsage - https://phabricator.wikimedia.org/T196962 [13:24:35] T300460: [API] WikibaseClient: wblistentityusage API module adds its results to the `query.pages` key in response - https://phabricator.wikimedia.org/T300460 [13:24:45] (03CR) 10Lucas Werkmeister (WMDE): "Should be testable here: https://test.wikidata.org/w/api.php?action=query&list=wbsubscribers&wblsentities=Q11" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE)) [13:26:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:26:37] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafkamon1003.eqiad.wmnet [13:26:50] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-webproxy: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914416 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [13:27:33] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/914777/41013/" [puppet] - 10https://gerrit.wikimedia.org/r/914777 (owner: 10Jaime Nuche) [13:30:00] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-wikireplica-dns: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914417 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [13:30:02] build-and-push-container-images is taking its time again [13:30:34] 6 patches per deployment window feels pretty optimistic these days [13:31:05] (maybe it’ll get faster again once we only deploy to k8s? fingers crossed) [13:31:08] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:31:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47389 and previous config saved to /var/cache/conftool/dbconfig/20230503-133117-ladsgroup.json [13:32:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47390 and previous config saved to /var/cache/conftool/dbconfig/20230503-133232-ladsgroup.json [13:33:45] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-enc-cli: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914418 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [13:33:52] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org [13:34:05] !log slyngshede@cumin1001 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM idm-test1001.wikimedia.org [13:34:27] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org [13:34:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [13:35:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P47391 and previous config saved to /var/cache/conftool/dbconfig/20230503-133528-ladsgroup.json [13:35:46] !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kafkamon1003.eqiad.wmnet [13:35:53] (03CR) 10Andrew Bogott: [C: 03+2] wmcs notify_maintainers: use mwopenstackclients for keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/914419 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [13:36:03] !log slyngshede@cumin1001 END (ERROR) - Cookbook sre.ganeti.reboot-vm (exit_code=97) for VM idm-test1001.wikimedia.org [13:36:11] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm2001.wikimedia.org [13:37:04] PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service,burrow-logging-eqiad.service,burrow-main-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:48] RECOVERY - Host db2184 is UP: PING OK - Packet loss = 0%, RTA = 35.61 ms [13:38:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P47392 and previous config saved to /var/cache/conftool/dbconfig/20230503-133855-ladsgroup.json [13:39:09] (03CR) 10Muehlenhoff: [C: 03+2] Use signed-by to include the Wikimedia repo starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [13:39:11] 10SRE-tools, 10Infrastructure-Foundations: cookbooks.sre.ganeti.reimage: failure reported when first puppet run succeeds after a retry - https://phabricator.wikimedia.org/T335863 (10herron) [13:39:15] subbu: fyi I’m planning to skip ahead to your config change once the current scap is done, so you don’t have to wait through the rest of the Wikidata changes [13:39:24] (I assume we’ll overrun the window) [13:39:40] oh dear, no, there’s LVS maintenance right after it /o\ [13:39:58] * Lucas_WMDE is used to the luxury of several free hours after the UTC afternoon backport window :D [13:40:04] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm2001.wikimedia.org [13:40:13] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm1001.wikimedia.org [13:40:15] sounds good! [13:40:15] then the rest of the Wikidata changes might just have to wait a few hours longer [13:40:37] (03CR) 10Klausman: [C: 03+1] Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [13:41:59] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:914437|wblistentityusage: Deprecate wbeu prefix, new output format (T300460 T196962)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [13:42:04] T196962: Module prefix 'wbeu' is shared between Wikibase\Client\Api\ApiListEntityUsage and Wikibase\Client\Api\ApiPropsEntityUsage - https://phabricator.wikimedia.org/T196962 [13:42:04] T300460: [API] WikibaseClient: wblistentityusage API module adds its results to the `query.pages` key in response - https://phabricator.wikimedia.org/T300460 [13:42:10] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:42:10] testing [13:42:59] https://www.wikidata.org/w/api.php?action=query&format=json&list=wblistentityusage&formatversion=2&wbeuentities=Q1 / https://www.wikidata.org/w/api.php?action=query&format=json&list=wblistentityusage&formatversion=2&wbleuentities=Q1 looks good on mwdebug, syncing [13:43:21] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm1001.wikimedia.org [13:43:24] (03CR) 10Klausman: [C: 03+1] Add conftool and service config for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914735 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [13:43:38] (03CR) 10Klausman: [C: 03+1] conftool-data: add config for the k8s ingress for ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914728 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [13:43:58] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm for new host kafkamon2003.codfw.wmnet [13:43:59] !log herron@cumin1001 START - Cookbook sre.dns.netbox [13:44:40] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:45:13] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10JMeybohm) a:05RLazarus→03JMeybohm [13:45:59] 10SRE, 10envoy, 10serviceops: Refactor envoy max_requests_per_connection from Cluster to HttpProtocolOptions - https://phabricator.wikimedia.org/T304124 (10JMeybohm) a:05RLazarus→03None [13:46:14] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:46:45] brb [13:46:53] !log herron@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kafkamon2003.codfw.wmnet - herron@cumin1001" [13:47:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P47393 and previous config saved to /var/cache/conftool/dbconfig/20230503-134740-ladsgroup.json [13:47:54] !log herron@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kafkamon2003.codfw.wmnet - herron@cumin1001" [13:47:54] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:47:54] !log herron@cumin1001 START - Cookbook sre.dns.wipe-cache kafkamon2003.codfw.wmnet on all recursors [13:47:57] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kafkamon2003.codfw.wmnet on all recursors [13:48:58] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-codfw [13:49:16] jouncebot: nowandnext [13:49:16] For the next 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1300) [13:49:16] In 0 hour(s) and 10 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1400) [13:50:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P47394 and previous config saved to /var/cache/conftool/dbconfig/20230503-135034-ladsgroup.json [13:51:08] PROBLEM - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:13] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-spreadcheck: use clouds.yaml section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914463 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [13:51:27] (03PS1) 10Herron: kafkamon: add kafkamon[12]003 to fw allow list [puppet] - 10https://gerrit.wikimedia.org/r/914787 (https://phabricator.wikimedia.org/T335424) [13:51:48] (03CR) 10Andrew Bogott: [C: 03+2] nfs-exportd: convert to using mwopenstackclients and --os-cloud [puppet] - 10https://gerrit.wikimedia.org/r/914464 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [13:52:25] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:914437|wblistentityusage: Deprecate wbeu prefix, new output format (T300460 T196962)]] (duration: 27m 54s) [13:52:29] T196962: Module prefix 'wbeu' is shared between Wikibase\Client\Api\ApiListEntityUsage and Wikibase\Client\Api\ApiPropsEntityUsage - https://phabricator.wikimedia.org/T196962 [13:52:30] T300460: [API] WikibaseClient: wblistentityusage API module adds its results to the `query.pages` key in response - https://phabricator.wikimedia.org/T300460 [13:52:39] (03PS5) 10Lucas Werkmeister (WMDE): Turn on experimental Parsoid Read Views support, except on commons & wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T335157) (owner: 10C. Scott Ananian) [13:52:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T335157) (owner: 10C. Scott Ananian) [13:52:51] yippee [13:53:14] (03PS2) 10Hashar: Checkout tested patch in a branch [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914754 [13:53:27] (03Merged) 10jenkins-bot: Turn on experimental Parsoid Read Views support, except on commons & wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T335157) (owner: 10C. Scott Ananian) [13:53:31] (03CR) 10Hashar: Checkout tested patch in a branch (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914754 (owner: 10Hashar) [13:53:42] let’s hope it finishes before the window ends… [13:53:53] re [13:53:57] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:910556|Turn on experimental Parsoid Read Views support, except on commons & wikidata (T335157)]] [13:54:00] T335157: Experimentally enable Parsoid Read Views pages on query string - https://phabricator.wikimedia.org/T335157 [13:54:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P47395 and previous config saved to /var/cache/conftool/dbconfig/20230503-135402-ladsgroup.json [13:54:21] I can wait for the next one, in fact I have to, so all good :) [13:55:01] ok phew ^^ [13:55:14] then I’ll probably do some more config changes after this one if that’s okay [13:55:25] *and backports actually [13:55:25] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and cscott: Backport for [[gerrit:910556|Turn on experimental Parsoid Read Views support, except on commons & wikidata (T335157)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:55:39] subbu: can you test the change? [13:55:55] Lucas_WMDE: please ping me when you are done, thanks [13:55:56] yes, i can .. is it on the servers? [13:56:06] should be on the mwdebug servers [13:56:09] sukhe: will do [13:56:09] ok. [13:56:53] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Synchronize only the /srv/prometheus directory when migrating data [puppet] - 10https://gerrit.wikimedia.org/r/914400 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [13:58:19] * Lucas_WMDE tries to remember what the next backport would be anyways [13:58:38] backport wbsubscribers fix to wmf branches, then do the config change for it on Test Wikidata, right MichaelG_WMDE? [13:59:01] yes, I think so [13:59:07] (03CR) 10Ayounsi: templates: add 20.172.in-addr.arpa (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez) [13:59:12] ok thanks [13:59:30] Lucas_WMDE, this is on eqiad right, not codfw? [13:59:43] `scap backport` says it synced to both [13:59:49] mwdebug1001, 1002, 2001, 2002 [13:59:50] ok .. [13:59:58] (03PS1) 10Stevemunene: Add analytics_product admin group for airflow [puppet] - 10https://gerrit.wikimedia.org/r/914788 (https://phabricator.wikimedia.org/T333000) [14:00:05] sukhe: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for LVS maintenance . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1400). [14:00:51] (03Restored) 10Lucas Werkmeister (WMDE): Fix output path of list=wbsubscribers API [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914298 (https://phabricator.wikimedia.org/T300458) (owner: 10Michael Große) [14:00:55] (03CR) 10Ayounsi: cloudlb: fix BGP IP address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [14:01:09] (03PS2) 10Lucas Werkmeister (WMDE): Fix output path of list=wbsubscribers API [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914298 (https://phabricator.wikimedia.org/T300458) (owner: 10Michael Große) [14:01:31] I’ll +2 the backport already, it’ll take a while anyways [14:01:44] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "starting gate-and-submit already" [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914298 (https://phabricator.wikimedia.org/T300458) (owner: 10Michael Große) [14:02:22] !log herron@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kafkamon2003.codfw.wmnet - herron@cumin1001" [14:02:26] it is alright to move forward with the backport despite the LVS maintenance? [14:02:27] hmm .. it doesn't seem to be having any effect at all. [14:02:39] 10ops-codfw, 10DBA: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Jhancock.wm) @Marostegui I can do this today. I tried earlier to reboot the idrac the unobtrusive way, holding the i button until the fans spin up, but it hasn't worked. The next step is to drain the flea power so we will nee... [14:02:39] * MichaelG_WMDE has no idea what "LVS maintenance" actually is [14:02:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P47396 and previous config saved to /var/cache/conftool/dbconfig/20230503-140246-ladsgroup.json [14:02:57] Lucas_WMDE, you can sync everywhere, and we can debug after to see what is going on. [14:03:19] maybe i am missing something here that Scott probably knows. [14:03:26] right now, it looks like a no-op. [14:03:27] !log herron@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kafkamon2003.codfw.wmnet - herron@cumin1001" [14:03:27] !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kafkamon2003.codfw.wmnet [14:03:49] subbu: ok thanks [14:04:00] yeah I didn’t see anything either but it’s not like I knew a lot about what to look for ^^ [14:04:08] 10ops-codfw, 10DBA: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Marostegui) @Jhancock.wm I will switchover this host tomorrow from its current master role - so it will be ready for you to power it down whenever you need. I will write here once it is all fine for you to power it off. [14:04:11] MichaelG_WMDE: we are decommissioning and provisioning an LVS server and can't do it when deploys are happening: T334703 [14:04:11] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [14:04:13] (I just picked a random page via the API and loaded it with the URL parameter from the commit message) [14:04:18] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [14:04:31] it works just fine on my local mediwaiki install .. [14:04:47] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [14:05:01] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder) [14:05:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T335838)', diff saved to https://phabricator.wikimedia.org/P47398 and previous config saved to /var/cache/conftool/dbconfig/20230503-140540-ladsgroup.json [14:07:10] sukhe: just to confirm, I have some more time for deploying, as long as I ping you at the end, correct? [14:07:22] or did you want to do the LVS thing now after all and I misunderstood? [14:07:37] (I’d like to get my backports out of the way but they can wait if needed) [14:07:37] Lucas_WMDE: we have to do the LVS thing now because dc-ops will be on site soon :) [14:07:43] hm [14:07:49] then I misunderstood your message earlier, sorry [14:08:01] no, that's on me too. I meant that you can finish the existing and last one safely [14:08:03] then I’ll ping you as soon as this scap is done and pause there [14:08:07] ok , I see [14:08:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Andrew) @papaul, note that these hosts are still pending some trial work in codfw1dev so you shouldn't spend any effort on these ho... [14:08:44] ok now I understand, “I have to” only meant “have to wait until the scap is done because otherwise all hell breaks loose” :'D [14:09:00] haha [14:09:02] sadly [14:09:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T335838)', diff saved to https://phabricator.wikimedia.org/P47399 and previous config saved to /var/cache/conftool/dbconfig/20230503-140908-ladsgroup.json [14:09:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:09:23] we have plans to fix this and should but yeah, that's more longterm than provisioning the hosts [14:09:25] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:910556|Turn on experimental Parsoid Read Views support, except on commons & wikidata (T335157)]] (duration: 15m 27s) [14:09:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:09:28] T335157: Experimentally enable Parsoid Read Views pages on query string - https://phabricator.wikimedia.org/T335157 [14:09:31] sukhe: I’m done for now, go ahead [14:09:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47400 and previous config saved to /var/cache/conftool/dbconfig/20230503-140932-ladsgroup.json [14:09:36] Lucas_WMDE: thank you! [14:09:57] (03PS1) 10David Caro: d/changelog: prepare for release 0.95 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914790 [14:09:58] !log stop pybal on lvs2007 to drain host for decommissioning [14:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:00] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:10:29] oh but that means I should retract my +2 because that backport won’t merge so quickly now after all [14:10:39] (03PS4) 10MdsShakil: Create autopatroller and patroller groups on bn.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) [14:10:44] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "nope, this needs to wait until after the LVS window" [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914298 (https://phabricator.wikimedia.org/T300458) (owner: 10Michael Große) [14:11:12] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [14:11:25] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [14:11:35] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [14:11:38] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [14:11:46] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [14:11:56] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [14:12:08] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [14:12:23] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [14:12:28] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [14:12:50] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:12:56] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [14:13:08] PROBLEM - pybal on lvs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:13:15] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [14:13:20] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [14:13:30] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [14:13:31] (03PS1) 10Jbond: apt: drop files from the puppet source [puppet] - 10https://gerrit.wikimedia.org/r/914791 [14:13:33] (03CR) 10MdsShakil: Create autopatroller and patroller groups on bn.wikiquote (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil) [14:13:58] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [14:14:01] Lucas_WMDE, let me know once the config chagne is everywhere. thanks! [14:14:06] (03PS5) 10MdsShakil: Create autopatroller and patroller groups on bn.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) [14:14:13] subbu: it should be everywhere by now [14:14:26] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [14:14:26] ty [14:14:26] (I’m done deploying now) [14:14:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41014/console" [puppet] - 10https://gerrit.wikimedia.org/r/914791 (owner: 10Jbond) [14:14:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T335838)', diff saved to https://phabricator.wikimedia.org/P47401 and previous config saved to /var/cache/conftool/dbconfig/20230503-141458-ladsgroup.json [14:14:59] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [14:15:02] PROBLEM - PyBal connections to etcd on lvs2007 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:15:11] ^ expected [14:15:16] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [14:15:26] (03CR) 10Jbond: [V: 03+1 C: 03+2] apt: drop files from the puppet source [puppet] - 10https://gerrit.wikimedia.org/r/914791 (owner: 10Jbond) [14:15:37] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [14:15:43] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [14:15:49] (03CR) 10Superpes15: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil) [14:16:01] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [14:16:43] (03CR) 10Superpes15: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil) [14:16:45] !log sukhe@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767 [14:16:48] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [14:17:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47402 and previous config saved to /var/cache/conftool/dbconfig/20230503-141752-ladsgroup.json [14:17:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [14:18:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [14:18:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T335838)', diff saved to https://phabricator.wikimedia.org/P47403 and previous config saved to /var/cache/conftool/dbconfig/20230503-141817-ladsgroup.json [14:24:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T335838)', diff saved to https://phabricator.wikimedia.org/P47404 and previous config saved to /var/cache/conftool/dbconfig/20230503-142427-ladsgroup.json [14:25:34] (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:26:07] (03PS2) 10EoghanGaffney: [gitlab/failover] Add rollback method [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 [14:26:54] !log Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in kafka main clusters - T334733 [14:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:57] T334733: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 [14:28:39] (03CR) 10Muehlenhoff: [C: 03+2] Revert "sre.hosts.reimage/sre.ganeti.reimage: Delete Puppet state file before reimage" [cookbooks] - 10https://gerrit.wikimedia.org/r/912311 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [14:29:13] 10SRE, 10Data-Engineering, 10SRE Observability, 10Event-Platform Value Stream (Sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10Ottomata) Done for Kafka main. We should do this for Kafka logging as well, so that when... [14:29:20] PROBLEM - PyBal backends health check on lvs2007 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:30:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P47405 and previous config saved to /var/cache/conftool/dbconfig/20230503-143005-ladsgroup.json [14:31:42] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10Aklapper) Hi @lojo, welcome to Wikimedia Phabricator! This Phabricator account does not use a `@wikimedia.de` email address, and currently there is no WMDE mediawiki.org SUL account [associated](htt... [14:33:17] !log set routing-options static route 208.80.153.224/28 next-hop 10.192.49.7 [move static route for high-traffic1 to lvs2010]: T335777 [14:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:20] T335777: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 [14:33:24] (03CR) 10Elukey: [C: 03+2] modules: duplicate the istio ingress template for 1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914306 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [14:33:34] (03PS5) 10Elukey: modules: add ml-staging cfg to the istio template [deployment-charts] - 10https://gerrit.wikimedia.org/r/914307 (https://phabricator.wikimedia.org/T335756) [14:34:25] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/output/914317/41016/" [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [14:34:33] (03Abandoned) 10Elukey: ml-services: limit deployments of experimental to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900381 (owner: 10Elukey) [14:34:41] (03CR) 10Elukey: [C: 03+2] modules: add ml-staging cfg to the istio template [deployment-charts] - 10https://gerrit.wikimedia.org/r/914307 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [14:36:18] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs2007.codfw.wmnet [14:36:39] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:37:48] (03PS1) 10Elukey: fast-api: update ingress.istio module version to 1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914793 (https://phabricator.wikimedia.org/T335756) [14:37:50] (03CR) 10Ahmon Dancy: [C: 03+1] "The issue with the deploy server was due to me trying https://gerrit.wikimedia.org/r/c/operations/puppet/+/906051 where /srv/mediawiki alr" [puppet] - 10https://gerrit.wikimedia.org/r/914777 (owner: 10Jaime Nuche) [14:37:54] (03PS1) 10Majavah: kubernetes: Remove deprecated state from buildservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914794 [14:38:38] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt frav1003 - jclark@cumin1001" [14:39:03] (03PS1) 10Elukey: ml-services: enable the 'mlstaging' ingress flag for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914795 (https://phabricator.wikimedia.org/T335756) [14:39:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P47406 and previous config saved to /var/cache/conftool/dbconfig/20230503-143933-ladsgroup.json [14:39:59] jouncebot: nowandnext [14:39:59] For the next 2 hour(s) and 20 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1400) [14:40:00] In 2 hour(s) and 20 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1700) [14:40:06] RECOVERY - Check systemd state on idm1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:06] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm1001.wikimedia.org [14:40:34] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt frav1003 - jclark@cumin1001" [14:40:34] !log jclark@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:41:07] (03CR) 10Elukey: [C: 03+2] fast-api: update ingress.istio module version to 1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914793 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [14:41:16] (03CR) 10Elukey: [C: 03+2] ml-services: enable the 'mlstaging' ingress flag for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914795 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [14:42:26] (03PS1) 10Eevans: restbase: upgrade Cassandra on restbase2012 & restbase1016 [puppet] - 10https://gerrit.wikimedia.org/r/914797 (https://phabricator.wikimedia.org/T335383) [14:42:59] !log Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in kafka logging clusters - T334733 [14:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:02] T334733: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 [14:43:23] (03CR) 10Clément Goubert: [C: 03+2] beta: delete old Puppet client bucket files from deployment server [puppet] - 10https://gerrit.wikimedia.org/r/914777 (owner: 10Jaime Nuche) [14:43:55] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm1001.wikimedia.org [14:45:02] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [14:45:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P47407 and previous config saved to /var/cache/conftool/dbconfig/20230503-144511-ladsgroup.json [14:46:17] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:46:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs2007.codfw.wmnet [14:46:24] 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs2007.codfw.wmnet` - lvs2007.codfw.wmnet (**WARN**) - Downtimed host on Ici... [14:46:29] 10SRE, 10Data-Engineering, 10SRE Observability, 10Event-Platform Value Stream (Sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10Ottomata) Done for logging clusters, and we all done! [14:48:39] (03CR) 10Eevans: [C: 03+2] restbase: upgrade Cassandra on restbase2012 & restbase1016 [puppet] - 10https://gerrit.wikimedia.org/r/914797 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [14:49:29] (03CR) 10Ssingh: [C: 03+2] lvs2007: decommission host for codfw hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/914341 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:50:13] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove decommissioned host lvs2007 from lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/914344 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:50:26] (03CR) 10Hnowlan: restbase: upgrade Cassandra on restbase2012 & restbase1016 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914797 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [14:51:34] 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh) [14:52:55] !log homer "cr*-codfw*" commit "Gerrit: 914344 remove decommissioned host lvs2007": T335777 [14:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:58] T335777: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 [14:53:10] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2012.codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [14:53:13] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [14:54:08] (03CR) 10David Caro: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914794 (owner: 10Majavah) [14:54:12] !log [finished] homer "cr*-codfw*" commit "Gerrit: 914344 remove decommissioned host lvs2007": T335777 [14:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:18] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 108, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:54:33] (03CR) 10David Caro: [C: 03+2] kubernetes: Remove deprecated state from buildservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914794 (owner: 10Majavah) [14:54:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P47408 and previous config saved to /var/cache/conftool/dbconfig/20230503-145440-ladsgroup.json [14:55:18] (03Merged) 10jenkins-bot: kubernetes: Remove deprecated state from buildservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914794 (owner: 10Majavah) [14:57:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:57:48] (03CR) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez) [14:57:50] (03PS2) 10David Caro: d/changelog: prepare for release 0.95 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914790 [14:58:12] (03PS1) 10Hokwelum: Increase number of retries for html download [puppet] - 10https://gerrit.wikimedia.org/r/914800 [14:59:04] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1111.eqiad.wmnet - https://phabricator.wikimedia.org/T335836 (10wiki_willy) a:05wiki_willy→03Jclark-ctr [14:59:36] !log fix backup route for high-traffic2 in codfw: set routing-options static route 208.80.153.240/28 next-hop 10.192.17.7 [14:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:49] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10Jclark-ctr) a:03Jclark-ctr [14:59:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet: support using FQDNs instead of hardcoded IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [15:00:09] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335775 (10Jclark-ctr) a:03Jclark-ctr [15:00:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T335838)', diff saved to https://phabricator.wikimedia.org/P47409 and previous config saved to /var/cache/conftool/dbconfig/20230503-150017-ladsgroup.json [15:00:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [15:00:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [15:00:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47410 and previous config saved to /var/cache/conftool/dbconfig/20230503-150042-ladsgroup.json [15:01:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T335838)', diff saved to https://phabricator.wikimedia.org/P47411 and previous config saved to /var/cache/conftool/dbconfig/20230503-150103-ladsgroup.json [15:01:13] (03PS2) 10Hokwelum: Increase number of retries for html download [puppet] - 10https://gerrit.wikimedia.org/r/914800 (https://phabricator.wikimedia.org/T335761) [15:02:48] PROBLEM - Check systemd state on db2184 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:24] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2012.codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [15:03:27] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [15:03:44] (03PS3) 10Majavah: d/changelog: prepare for release 0.95 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914790 (owner: 10David Caro) [15:03:54] (03CR) 10Majavah: [C: 03+1] d/changelog: prepare for release 0.95 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914790 (owner: 10David Caro) [15:07:00] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [15:07:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47412 and previous config saved to /var/cache/conftool/dbconfig/20230503-150702-ladsgroup.json [15:08:10] (03CR) 10David Caro: [C: 03+2] d/changelog: prepare for release 0.95 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914790 (owner: 10David Caro) [15:09:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47413 and previous config saved to /var/cache/conftool/dbconfig/20230503-150947-ladsgroup.json [15:09:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T335838)', diff saved to https://phabricator.wikimedia.org/P47414 and previous config saved to /var/cache/conftool/dbconfig/20230503-150947-ladsgroup.json [15:09:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [15:09:59] (03Merged) 10jenkins-bot: d/changelog: prepare for release 0.95 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914790 (owner: 10David Caro) [15:10:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [15:10:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T335838)', diff saved to https://phabricator.wikimedia.org/P47415 and previous config saved to /var/cache/conftool/dbconfig/20230503-151013-ladsgroup.json [15:16:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T335838)', diff saved to https://phabricator.wikimedia.org/P47416 and previous config saved to /var/cache/conftool/dbconfig/20230503-151627-ladsgroup.json [15:17:15] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [15:17:18] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [15:17:18] (03CR) 10Jbond: [C: 03+2] tox: do not skip missing interpreters on CI [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914746 (owner: 10Hashar) [15:17:20] (03CR) 10Jbond: [C: 03+2] tox: use default python for local testing [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914747 (owner: 10Hashar) [15:17:23] (03CR) 10Jbond: [C: 03+2] Checkout tested patch in a branch [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914754 (owner: 10Hashar) [15:18:37] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review: Deal with archival of Stretch on Debian mirrors - https://phabricator.wikimedia.org/T335282 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is resolved: * apt sources on remaining stretch servers stopped u... [15:19:34] (03Merged) 10jenkins-bot: tox: do not skip missing interpreters on CI [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914746 (owner: 10Hashar) [15:19:36] (03Merged) 10jenkins-bot: tox: use default python for local testing [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914747 (owner: 10Hashar) [15:19:39] (03Merged) 10jenkins-bot: Checkout tested patch in a branch [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914754 (owner: 10Hashar) [15:21:34] (03CR) 10Muehlenhoff: "FYI, new access groups need discussion/approval in the weekly SRE Infrastructure Foundations meeting (happening next Monday) so this will " [puppet] - 10https://gerrit.wikimedia.org/r/914788 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [15:22:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P47417 and previous config saved to /var/cache/conftool/dbconfig/20230503-152208-ladsgroup.json [15:22:15] (03PS1) 10Elukey: fastapi: bump chart's version [deployment-charts] - 10https://gerrit.wikimedia.org/r/914849 [15:23:56] (03PS1) 10Eevans: restbase: upgrade cluster to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/914851 (https://phabricator.wikimedia.org/T335383) [15:24:23] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [15:24:26] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [15:24:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P47418 and previous config saved to /var/cache/conftool/dbconfig/20230503-152453-ladsgroup.json [15:25:23] (03CR) 10Elukey: [C: 03+2] fastapi: bump chart's version [deployment-charts] - 10https://gerrit.wikimedia.org/r/914849 (owner: 10Elukey) [15:28:09] (03PS1) 10Jbond: 2.5.6: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914853 [15:28:46] (03CR) 10Jbond: [V: 03+2 C: 03+2] 2.5.6: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914853 (owner: 10Jbond) [15:29:16] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [15:30:16] (03PS1) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/914854 [15:30:25] 10SRE-tools, 10Infrastructure-Foundations: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10MoritzMuehlenhoff) [15:31:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P47419 and previous config saved to /var/cache/conftool/dbconfig/20230503-153133-ladsgroup.json [15:31:53] (03CR) 10JHathaway: puppet: use a string rather than a symbol to call a puppet function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914406 (owner: 10JHathaway) [15:32:04] (03CR) 10JHathaway: [C: 03+2] puppet: use a string rather than a symbol to call a puppet function [puppet] - 10https://gerrit.wikimedia.org/r/914406 (owner: 10JHathaway) [15:32:28] (03CR) 10JHathaway: [C: 03+2] puppet7: re-add host core [puppet] - 10https://gerrit.wikimedia.org/r/914408 (owner: 10JHathaway) [15:32:53] (03PS1) 10Ssingh: lvs2011: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/914856 (https://phabricator.wikimedia.org/T326767) [15:32:54] (JobUnavailable) firing: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:32:57] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/914854 (owner: 10Jbond) [15:33:30] (JobUnavailable) resolved: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet [15:34:36] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [15:34:39] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [15:34:51] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:35:18] (03PS2) 10Ssingh: lvs2011: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/914856 (https://phabricator.wikimedia.org/T326767) [15:36:57] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for lvs2011 - pt1979@cumin2002" [15:37:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P47420 and previous config saved to /var/cache/conftool/dbconfig/20230503-153715-ladsgroup.json [15:37:32] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts puppetmaster1002.eqiad.wmnet [15:37:54] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts puppetmaster1002.eqiad.wmnet [15:38:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for lvs2011 - pt1979@cumin2002" [15:38:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:38:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet [15:38:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet [15:39:48] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Jhancock.wm) @Papaul Dell kicked back my updated dispatch request for not enough troubleshooting. Since the server was down, I swapped DIMM A6 with DIMM A5 about two hours ago and the server ha... [15:40:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P47421 and previous config saved to /var/cache/conftool/dbconfig/20230503-154000-ladsgroup.json [15:40:54] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [15:40:56] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [15:40:59] (03PS3) 10Ssingh: lvs2011: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/914856 (https://phabricator.wikimedia.org/T326767) [15:40:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host lvs2011.mgmt.codfw.wmnet with reboot policy FORCED [15:41:18] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [15:41:59] (03PS2) 10Arturo Borrero Gonzalez: cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) [15:42:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet [15:42:51] (03CR) 10Arturo Borrero Gonzalez: cloudlb: fix BGP IP address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [15:42:53] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [15:42:59] (03PS1) 10Elukey: admin_ng: add ml-staging among helmfile_namespace_certs's options [deployment-charts] - 10https://gerrit.wikimedia.org/r/914859 (https://phabricator.wikimedia.org/T335756) [15:43:01] (03CR) 10Eevans: [C: 03+2] restbase: upgrade cluster to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/914851 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [15:43:05] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Marostegui) How can Dell kick back this request when their systems logs say: `Multi-bit memory errors are detected on the memory device at location(s) DIMM_A6. Immediately replace the DIMM.` - t... [15:45:34] (HelmReleaseBadStatus) resolved: Helm release machinetranslation/staging on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:46:38] (03PS1) 10Jbond: sre.hardward.upgrade-firmware: Ensure we only apply version check to gen 14 [cookbooks] - 10https://gerrit.wikimedia.org/r/914860 [15:46:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P47422 and previous config saved to /var/cache/conftool/dbconfig/20230503-154639-ladsgroup.json [15:48:45] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[13-27].codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [15:48:47] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [15:51:06] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/914856 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [15:52:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47423 and previous config saved to /var/cache/conftool/dbconfig/20230503-155221-ladsgroup.json [15:55:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47424 and previous config saved to /var/cache/conftool/dbconfig/20230503-155506-ladsgroup.json [15:55:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:55:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:56:00] jouncebot: nowandnext [15:56:00] For the next 1 hour(s) and 3 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1400) [15:56:00] In 1 hour(s) and 3 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1700) [15:59:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:59:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:59:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T335838)', diff saved to https://phabricator.wikimedia.org/P47425 and previous config saved to /var/cache/conftool/dbconfig/20230503-155946-ladsgroup.json [16:00:35] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10jcrespo) 2 in a row, for hw errors captured in their own hw logs? T335396#8821456 Will we have to send our lawyers so they honor their contract obligations? [16:00:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T335838)', diff saved to https://phabricator.wikimedia.org/P47426 and previous config saved to /var/cache/conftool/dbconfig/20230503-160039-ladsgroup.json [16:00:46] (03CR) 10Klausman: [C: 03+1] admin_ng: add ml-staging among helmfile_namespace_certs's options [deployment-charts] - 10https://gerrit.wikimedia.org/r/914859 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [16:01:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T335838)', diff saved to https://phabricator.wikimedia.org/P47427 and previous config saved to /var/cache/conftool/dbconfig/20230503-160146-ladsgroup.json [16:03:07] (03PS1) 10Jdlrobson: Enable graphs on test wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) [16:03:23] (03CR) 10Herron: [C: 03+2] kafkamon: add kafkamon[12]003 to fw allow list [puppet] - 10https://gerrit.wikimedia.org/r/914787 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [16:05:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [16:05:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [16:06:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T335838)', diff saved to https://phabricator.wikimedia.org/P47428 and previous config saved to /var/cache/conftool/dbconfig/20230503-160601-ladsgroup.json [16:06:41] (03PS3) 10Hokwelum: Increase number of retries for html dumps download [puppet] - 10https://gerrit.wikimedia.org/r/914800 (https://phabricator.wikimedia.org/T335761) [16:07:12] hola marostegui - got a minute for a quick DB question? [16:07:42] (03Abandoned) 10Jbond: sre.hardward.upgrade-firmware: Ensure we only apply version check to gen 14 [cookbooks] - 10https://gerrit.wikimedia.org/r/914860 (owner: 10Jbond) [16:08:07] (03CR) 10ArielGlenn: [C: 03+2] Increase number of retries for html dumps download [puppet] - 10https://gerrit.wikimedia.org/r/914800 (https://phabricator.wikimedia.org/T335761) (owner: 10Hokwelum) [16:08:44] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts puppetmaster2001.codfw.wmnet [16:08:57] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts puppetmaster2001.codfw.wmnet [16:11:32] (03PS3) 10Arturo Borrero Gonzalez: cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) [16:12:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T335838)', diff saved to https://phabricator.wikimedia.org/P47429 and previous config saved to /var/cache/conftool/dbconfig/20230503-161235-ladsgroup.json [16:13:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs2011.mgmt.codfw.wmnet with reboot policy FORCED [16:14:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T335838)', diff saved to https://phabricator.wikimedia.org/P47430 and previous config saved to /var/cache/conftool/dbconfig/20230503-161402-ladsgroup.json [16:14:24] (03PS1) 10Jbond: sre.hardward.upgrade-firmware: drop support for Gen13 idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/914866 [16:15:22] RECOVERY - Check systemd state on db2184 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:35] (03PS4) 10Arturo Borrero Gonzalez: cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) [16:15:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P47431 and previous config saved to /var/cache/conftool/dbconfig/20230503-161545-ladsgroup.json [16:17:26] (03CR) 10CI reject: [V: 04-1] sre.hardward.upgrade-firmware: drop support for Gen13 idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/914866 (owner: 10Jbond) [16:18:32] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2011'] [16:18:38] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/914772/41019/" [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [16:19:34] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs2011'] [16:19:38] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2011'] [16:19:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2011'] [16:20:06] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2011'] [16:20:14] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs2011'] [16:20:58] (03PS5) 10Arturo Borrero Gonzalez: cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) [16:23:40] RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:24] RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:33] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2011'] [16:27:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P47432 and previous config saved to /var/cache/conftool/dbconfig/20230503-162741-ladsgroup.json [16:28:22] PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [16:29:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P47433 and previous config saved to /var/cache/conftool/dbconfig/20230503-162908-ladsgroup.json [16:30:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:30:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P47434 and previous config saved to /var/cache/conftool/dbconfig/20230503-163051-ladsgroup.json [16:31:32] RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:36] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs2011'] [16:35:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:35:52] PROBLEM - Check systemd state on ml-serve1005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:18] PROBLEM - dump of backup1-codfw in codfw on backupmon1001 is CRITICAL: Last dump for backup1-codfw at codfw (db2184) taken on 2023-05-03 16:20:01 is 17 GiB, but the previous one was 15 GiB, a change of +16.0 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:42:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P47435 and previous config saved to /var/cache/conftool/dbconfig/20230503-164248-ladsgroup.json [16:43:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [16:43:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [16:44:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P47436 and previous config saved to /var/cache/conftool/dbconfig/20230503-164414-ladsgroup.json [16:45:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T335838)', diff saved to https://phabricator.wikimedia.org/P47437 and previous config saved to /var/cache/conftool/dbconfig/20230503-164557-ladsgroup.json [16:46:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [16:46:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [16:46:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T335838)', diff saved to https://phabricator.wikimedia.org/P47438 and previous config saved to /var/cache/conftool/dbconfig/20230503-164622-ladsgroup.json [16:46:37] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [16:46:41] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [16:47:14] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [16:47:17] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [16:47:41] (03PS1) 10Urbanecm: Mentor dashboard: Move away from alpha/beta/stable [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914836 (https://phabricator.wikimedia.org/T334630) [16:47:54] (03PS1) 10Urbanecm: Mentor dashboard: Move away from alpha/beta/stable [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914837 (https://phabricator.wikimedia.org/T334630) [16:52:37] (03PS1) 10Ssingh: sites.yaml: add new LVS host lvs2011 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/914871 (https://phabricator.wikimedia.org/T326767) [16:57:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T335838)', diff saved to https://phabricator.wikimedia.org/P47440 and previous config saved to /var/cache/conftool/dbconfig/20230503-165754-ladsgroup.json [16:58:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [16:58:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T335838)', diff saved to https://phabricator.wikimedia.org/P47441 and previous config saved to /var/cache/conftool/dbconfig/20230503-165811-ladsgroup.json [16:58:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [16:58:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T335838)', diff saved to https://phabricator.wikimedia.org/P47442 and previous config saved to /var/cache/conftool/dbconfig/20230503-165818-ladsgroup.json [16:58:44] !log herron@cumin1001 START - Cookbook sre.ganeti.reimage for host kafkamon2003.codfw.wmnet with OS bullseye [16:58:51] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): 2 VMs for kafkamon - https://phabricator.wikimedia.org/T335426 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by herron@cumin1001 for host kafkamon2003.codfw.wmnet with OS bullseye [16:59:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T335838)', diff saved to https://phabricator.wikimedia.org/P47443 and previous config saved to /var/cache/conftool/dbconfig/20230503-165920-ladsgroup.json [16:59:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [16:59:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [16:59:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T335838)', diff saved to https://phabricator.wikimedia.org/P47444 and previous config saved to /var/cache/conftool/dbconfig/20230503-165954-ladsgroup.json [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1700) [17:00:41] ^ please note that there is a scap lock in progres, as we are still provisioning the lvs host in codfw [17:00:50] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:01:04] if there is any deployment for this slot, please let me know and I will lift it and stop the work (and not resume it) [17:01:56] (03PS2) 10Jdlrobson: Enable graphs on test wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) [17:02:14] (03PS1) 10Ottomata: flink-app - quote all flinkConfiguration values [deployment-charts] - 10https://gerrit.wikimedia.org/r/914874 [17:02:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2011.codfw.wmnet with reason: host reimage [17:03:09] (03CR) 10Ottomata: [C: 03+2] flink-app - quote all flinkConfiguration values [deployment-charts] - 10https://gerrit.wikimedia.org/r/914874 (owner: 10Ottomata) [17:03:40] (03CR) 10Majavah: Enable graphs on test wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) (owner: 10Jdlrobson) [17:05:44] lifting the lock as it's unlikely we will finish reimaging the next lvs host by then, including the "predictable interfaces" and all that :) [17:05:47] !log sukhe@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767 (duration: 169m 01s) [17:05:50] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [17:05:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: host reimage [17:06:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T335838)', diff saved to https://phabricator.wikimedia.org/P47445 and previous config saved to /var/cache/conftool/dbconfig/20230503-170607-ladsgroup.json [17:07:19] (03CR) 10Jelto: [C: 03+1] "lgtm for easier migration and switchovers." [dns] - 10https://gerrit.wikimedia.org/r/914369 (https://phabricator.wikimedia.org/T335797) (owner: 10Dzahn) [17:07:31] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [17:07:48] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [17:08:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T335838)', diff saved to https://phabricator.wikimedia.org/P47446 and previous config saved to /var/cache/conftool/dbconfig/20230503-170821-ladsgroup.json [17:13:17] (03CR) 10Dzahn: [C: 03+2] add discovery records for miscweb in eqiad and miscweb in codfw [dns] - 10https://gerrit.wikimedia.org/r/914369 (https://phabricator.wikimedia.org/T335797) (owner: 10Dzahn) [17:13:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P47447 and previous config saved to /var/cache/conftool/dbconfig/20230503-171317-ladsgroup.json [17:13:20] (03PS3) 10Dzahn: add discovery records for miscweb in eqiad and miscweb in codfw [dns] - 10https://gerrit.wikimedia.org/r/914369 (https://phabricator.wikimedia.org/T335797) [17:15:28] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafkamon2003.codfw.wmnet with reason: host reimage [17:18:38] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafkamon2003.codfw.wmnet with reason: host reimage [17:21:08] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:21:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P47448 and previous config saved to /var/cache/conftool/dbconfig/20230503-172114-ladsgroup.json [17:22:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:22:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2011.codfw.wmnet with OS bullseye [17:22:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye completed... [17:23:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) [17:23:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P47449 and previous config saved to /var/cache/conftool/dbconfig/20230503-172328-ladsgroup.json [17:23:48] (03PS16) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [17:28:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P47450 and previous config saved to /var/cache/conftool/dbconfig/20230503-172824-ladsgroup.json [17:31:40] (03PS17) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [17:32:34] !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kafkamon2003.codfw.wmnet with OS bullseye [17:32:39] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): 2 VMs for kafkamon - https://phabricator.wikimedia.org/T335426 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by herron@cumin1001 for host kafkamon2003.codfw.wmnet with OS bullseye completed: - kafkamon2003 (**PASS**) - Remov... [17:36:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P47451 and previous config saved to /var/cache/conftool/dbconfig/20230503-173620-ladsgroup.json [17:38:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P47452 and previous config saved to /var/cache/conftool/dbconfig/20230503-173834-ladsgroup.json [17:40:51] (03PS1) 10Herron: kafkamon: cut over to bullseye exporters [puppet] - 10https://gerrit.wikimedia.org/r/914876 (https://phabricator.wikimedia.org/T335424) [17:41:11] RECOVERY - Check systemd state on wdqs2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:11] !log bking@cumin1001 reboot wdqs20[13-22].codfw.wmnet T335835 [17:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T335838)', diff saved to https://phabricator.wikimedia.org/P47453 and previous config saved to /var/cache/conftool/dbconfig/20230503-174330-ladsgroup.json [17:46:26] (03PS1) 10BCornwall: debian/rules: Add --buildsystem=pybuild [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/914877 [17:48:35] (03PS1) 10Jbond: install_server: improve readability of netmask logic [puppet] - 10https://gerrit.wikimedia.org/r/914878 [17:49:27] (03PS2) 10Jbond: install_server: improve readability of netmask logic [puppet] - 10https://gerrit.wikimedia.org/r/914878 [17:50:01] (03PS3) 10Jbond: install_server: improve readability of netmask logic [puppet] - 10https://gerrit.wikimedia.org/r/914878 [17:50:43] (03PS2) 10BCornwall: debian/rules: Add --buildsystem=pybuild [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/914877 [17:51:00] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/914877 (owner: 10BCornwall) [17:51:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T335838)', diff saved to https://phabricator.wikimedia.org/P47454 and previous config saved to /var/cache/conftool/dbconfig/20230503-175126-ladsgroup.json [17:51:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [17:51:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [17:52:12] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:52:26] (03CR) 10BCornwall: [C: 03+2] debian/rules: Add --buildsystem=pybuild [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/914877 (owner: 10BCornwall) [17:53:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:53:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T335838)', diff saved to https://phabricator.wikimedia.org/P47455 and previous config saved to /var/cache/conftool/dbconfig/20230503-175340-ladsgroup.json [17:53:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [17:53:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [17:54:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T335838)', diff saved to https://phabricator.wikimedia.org/P47456 and previous config saved to /var/cache/conftool/dbconfig/20230503-175404-ladsgroup.json [17:55:15] 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10Papaul) @Jhancock.wm can you run the netbox offline script and get lvs2007 out of the rack and into storage ? Thanks [17:56:27] (03CR) 10Jdlrobson: Enable graphs on test wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) (owner: 10Jdlrobson) [17:57:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [17:58:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [17:58:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T335838)', diff saved to https://phabricator.wikimedia.org/P47457 and previous config saved to /var/cache/conftool/dbconfig/20230503-175806-ladsgroup.json [18:00:06] brennen and jeena: My dear minions, it's time we take the moon! Just kidding. Time for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1800). [18:00:06] brennen and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1800). [18:00:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T335838)', diff saved to https://phabricator.wikimedia.org/P47458 and previous config saved to /var/cache/conftool/dbconfig/20230503-180018-ladsgroup.json [18:02:34] o/ [18:02:47] sukhe: safe to proceed w/train? [18:04:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T335838)', diff saved to https://phabricator.wikimedia.org/P47459 and previous config saved to /var/cache/conftool/dbconfig/20230503-180438-ladsgroup.json [18:05:18] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder) [18:07:14] (03PS4) 10Ssingh: lvs2011: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/914856 (https://phabricator.wikimedia.org/T326767) [18:07:19] brennen: yes please [18:07:20] sorry, just saw [18:07:35] no worries! we're not on a time crunch. [18:08:20] !log train 1.41.0-wmf.7 (T330213): logs quiet and no current blockers, rolling to group1 [18:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:23] T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213 [18:08:50] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914879 (https://phabricator.wikimedia.org/T330213) [18:08:52] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914879 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot) [18:09:51] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914879 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot) [18:11:50] for what is worth, i'm currently unable to access Wikimedia sites (connection times out). [18:12:26] I can access enwiki urbanecm [18:12:32] Workin for me. [18:12:36] Have you tried different network [18:12:37] same [18:12:44] (working for me) [18:12:44] no visible issues [18:14:02] okay, might be a wiki-specific issue in the WMCZ's office connection, appears to work via mobile data. sorry for the false alarm then! [18:14:27] np! I am a bit on the edge because we have an LVS host down in codfw. in theory it should not be a problem but if it does, then I am ready to depool codfw :) [18:15:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P47460 and previous config saved to /var/cache/conftool/dbconfig/20230503-181524-ladsgroup.json [18:15:47] (03PS2) 10Majavah: hieradata: remove files for long-gone hosts [puppet] - 10https://gerrit.wikimedia.org/r/914268 [18:15:49] (03PS2) 10Majavah: O:wmcs::nfs: delete old primary role files [puppet] - 10https://gerrit.wikimedia.org/r/914269 [18:15:51] (03PS2) 10Majavah: P::ldap::client::labs: drop support for production [puppet] - 10https://gerrit.wikimedia.org/r/914270 [18:15:53] (03PS2) 10Majavah: labstore: remove unused files [puppet] - 10https://gerrit.wikimedia.org/r/914272 [18:16:10] (03Abandoned) 10Majavah: O:wmcs::nfs: delete old test role [puppet] - 10https://gerrit.wikimedia.org/r/914271 (owner: 10Majavah) [18:16:38] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.7 refs T330213 [18:16:43] T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213 [18:19:40] ftr, tracert ends at 195.2.20.74 / ae44-xcr1.att.cw.net, which seems to be within Vodafone's network. [18:19:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P47461 and previous config saved to /var/cache/conftool/dbconfig/20230503-181944-ladsgroup.json [18:20:03] (03PS2) 10Eevans: Add component/cassandra41 for Cassandra 4.1.x releases [puppet] - 10https://gerrit.wikimedia.org/r/912376 (https://phabricator.wikimedia.org/T313814) [18:20:34] sukhe: ^ [18:20:49] Maybe issues between Vodafone and wikimedia then [18:20:59] urbanecm: to drmrs or esams? [18:21:57] esams. seems to work again now though. [18:22:15] ok great! [18:22:22] * sukhe loves self-resolving issues [18:22:24] :) [18:22:26] me too! [18:22:57] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.7 refs T330213 (duration: 06m 18s) [18:23:01] T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213 [18:24:39] Networks confuse me [18:24:59] Because they never break when connectivity ops are looking [18:26:30] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[13-27].codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [18:26:33] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [18:26:48] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[17-33].eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [18:30:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P47462 and previous config saved to /var/cache/conftool/dbconfig/20230503-183030-ladsgroup.json [18:33:20] (03PS1) 10Dzahn: trafficserver: change name of the miscweb backend to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/914881 [18:34:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P47463 and previous config saved to /var/cache/conftool/dbconfig/20230503-183451-ladsgroup.json [18:45:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T335838)', diff saved to https://phabricator.wikimedia.org/P47464 and previous config saved to /var/cache/conftool/dbconfig/20230503-184536-ladsgroup.json [18:45:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [18:46:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [18:46:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T335838)', diff saved to https://phabricator.wikimedia.org/P47465 and previous config saved to /var/cache/conftool/dbconfig/20230503-184610-ladsgroup.json [18:49:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T335838)', diff saved to https://phabricator.wikimedia.org/P47466 and previous config saved to /var/cache/conftool/dbconfig/20230503-184957-ladsgroup.json [18:50:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [18:50:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [18:50:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [18:50:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [18:50:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T335838)', diff saved to https://phabricator.wikimedia.org/P47467 and previous config saved to /var/cache/conftool/dbconfig/20230503-185026-ladsgroup.json [18:55:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T335838)', diff saved to https://phabricator.wikimedia.org/P47468 and previous config saved to /var/cache/conftool/dbconfig/20230503-185526-ladsgroup.json [18:56:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T335838)', diff saved to https://phabricator.wikimedia.org/P47469 and previous config saved to /var/cache/conftool/dbconfig/20230503-185654-ladsgroup.json [18:57:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [19:02:07] (03CR) 10Eevans: [C: 03+2] Add component/cassandra41 for Cassandra 4.1.x releases [puppet] - 10https://gerrit.wikimedia.org/r/912376 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:09:07] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:10:18] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:10:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P47470 and previous config saved to /var/cache/conftool/dbconfig/20230503-191032-ladsgroup.json [19:12:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P47471 and previous config saved to /var/cache/conftool/dbconfig/20230503-191200-ladsgroup.json [19:19:26] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001 [19:20:27] !log bking@cumin1001 reboot Elastic cluster for T335835 [19:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:01] (03PS3) 10Jdlrobson: Enable graphs on test wikipedia and mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) [19:24:47] (03PS2) 10Jdlrobson: Explicitly enable MFCustomSiteModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913241 (https://phabricator.wikimedia.org/T270603) [19:25:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P47472 and previous config saved to /var/cache/conftool/dbconfig/20230503-192538-ladsgroup.json [19:27:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P47473 and previous config saved to /var/cache/conftool/dbconfig/20230503-192707-ladsgroup.json [19:29:55] (03CR) 10SBassett: [C: 03+1] "(from a security perspective)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) (owner: 10Jdlrobson) [19:30:37] PROBLEM - Check systemd state on elastic2067 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:42] (SystemdUnitFailed) firing: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:31:09] PROBLEM - Check systemd state on elastic2068 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:15] PROBLEM - Check systemd state on elastic2085 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:34:03] (03CR) 10Btullis: [V: 03+1] jupyterhub-conda: Fix incompatibility with HDFS-FUSE mount (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene) [19:34:05] (03CR) 10Btullis: [V: 03+1 C: 03+2] jupyterhub-conda: Fix incompatibility with HDFS-FUSE mount [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene) [19:35:15] RECOVERY - Check systemd state on elastic2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:42] (SystemdUnitFailed) resolved: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:47] RECOVERY - Check systemd state on elastic2068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:53] RECOVERY - Check systemd state on elastic2085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:22] !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001 [19:40:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T335838)', diff saved to https://phabricator.wikimedia.org/P47474 and previous config saved to /var/cache/conftool/dbconfig/20230503-194045-ladsgroup.json [19:40:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:41:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:42:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T335838)', diff saved to https://phabricator.wikimedia.org/P47475 and previous config saved to /var/cache/conftool/dbconfig/20230503-194213-ladsgroup.json [19:42:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [19:42:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [19:42:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T335838)', diff saved to https://phabricator.wikimedia.org/P47476 and previous config saved to /var/cache/conftool/dbconfig/20230503-194238-ladsgroup.json [19:43:46] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001 - T335835 [19:49:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T335838)', diff saved to https://phabricator.wikimedia.org/P47477 and previous config saved to /var/cache/conftool/dbconfig/20230503-194905-ladsgroup.json [19:54:55] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:55:41] PROBLEM - Check systemd state on elastic2048 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:55:55] PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:56:12] (SystemdUnitFailed) firing: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:56:17] PROBLEM - Check systemd state on elastic2082 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:57:15] RECOVERY - Check systemd state on elastic2048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:57:29] RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T2000). [20:00:05] MdsShakil and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:22] Hello 🙋 [20:01:03] RECOVERY - Check systemd state on elastic2082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:09] PROBLEM - Check systemd state on elastic2050 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:12] (SystemdUnitFailed) firing: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:01:27] (SystemdUnitFailed) firing: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:01:37] present [20:04:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P47478 and previous config saved to /var/cache/conftool/dbconfig/20230503-200411-ladsgroup.json [20:05:07] (03PS2) 10RLazarus: Render SLO and SLI numbers as percentunit [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 [20:05:47] RECOVERY - Check systemd state on elastic2050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:12] (SystemdUnitFailed) resolved: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:03] hi - i can deploy [20:08:53] MdsShakil: i'll start with yours [20:09:19] (03CR) 10RLazarus: "Dashboard/slo-Linkrecommendation view: https://grafana.wikimedia.org/dashboard/snapshot/L337vYP1OAmYC0L2jWowYT6R58040t2I" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus) [20:09:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil) [20:10:33] (03Merged) 10jenkins-bot: Create autopatroller and patroller groups on bn.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil) [20:11:02] !log cjming@deploy1002 Started scap: Backport for [[gerrit:914428|Create autopatroller and patroller groups on bn.wikiquote (T335829)]] [20:11:06] T335829: Create autopatroller and patroller groups on bnwikiquote - https://phabricator.wikimedia.org/T335829 [20:11:58] (03CR) 10Clare Ming: [C: 03+2] Router handling code should be centralized into mmv.bootstrap [extensions/MultimediaViewer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914301 (https://phabricator.wikimedia.org/T236591) (owner: 10Jdlrobson) [20:12:07] (03CR) 10RLazarus: Render SLO and SLI numbers as percentunit (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus) [20:12:59] !log cjming@deploy1002 cjming and mdsshakil: Backport for [[gerrit:914428|Create autopatroller and patroller groups on bn.wikiquote (T335829)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:13:06] MdsShakil: can you test? [20:13:42] !log fab@deploy1002 Started deploy [airflow-dags/research@f8dad05]: (no justification provided) [20:13:52] (03CR) 10Herron: [C: 03+1] "Nice one, simplifies the slo queries as well, sweet" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus) [20:14:01] !log fab@deploy1002 Finished deploy [airflow-dags/research@f8dad05]: (no justification provided) (duration: 00m 19s) [20:14:02] cjming: look good to me [20:14:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: increase thumbor resource limits, eqiad replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914737 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [20:14:18] great - syncing [20:14:35] (03Merged) 10jenkins-bot: Router handling code should be centralized into mmv.bootstrap [extensions/MultimediaViewer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914301 (https://phabricator.wikimedia.org/T236591) (owner: 10Jdlrobson) [20:14:35] PROBLEM - Check systemd state on elastic2066 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:39] PROBLEM - Check systemd state on elastic2046 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:51] (03CR) 10Herron: "oop I spoke too soon, will wait for followup PS" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus) [20:14:59] (03PS4) 10Clare Ming: Enable graphs on test wikipedia and mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) (owner: 10Jdlrobson) [20:16:09] RECOVERY - Check systemd state on elastic2066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:11] RECOVERY - Check systemd state on elastic2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:12] (SystemdUnitFailed) firing: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:17:47] (03CR) 10RLazarus: Render SLO and SLI numbers as percentunit (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus) [20:18:40] (03CR) 10Herron: [C: 03+1] Render SLO and SLI numbers as percentunit (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus) [20:19:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P47479 and previous config saved to /var/cache/conftool/dbconfig/20230503-201918-ladsgroup.json [20:19:39] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:914428|Create autopatroller and patroller groups on bn.wikiquote (T335829)]] (duration: 08m 36s) [20:19:41] T335829: Create autopatroller and patroller groups on bnwikiquote - https://phabricator.wikimedia.org/T335829 [20:19:41] MdsShakil: should be live! [20:19:57] Jdlrobson: starting your patches now [20:20:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) (owner: 10Jdlrobson) [20:20:51] cjming: Thank you! [20:20:58] (03Merged) 10jenkins-bot: Enable graphs on test wikipedia and mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) (owner: 10Jdlrobson) [20:21:12] (SystemdUnitFailed) resolved: (8) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2046:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:48] !log cjming@deploy1002 Started scap: Backport for [[gerrit:914863|Enable graphs on test wikipedia and mediawiki.org (T334940)]] [20:21:51] T334940: All Graphs broken on Wikimedia wikis (due to security issue T334895) - https://phabricator.wikimedia.org/T334940 [20:22:20] (03PS3) 10Jdlrobson: Explicitly enable MFCustomSiteModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913241 (https://phabricator.wikimedia.org/T270603) [20:23:18] !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:914863|Enable graphs on test wikipedia and mediawiki.org (T334940)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:23:20] Jdlrobson: can you test your graphs patch? [20:24:18] (03CR) 10RLazarus: [V: 03+2 C: 03+2] Render SLO and SLI numbers as percentunit (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus) [20:24:29] cjming: graphs is looking good to sync [20:24:36] fabu - syncing [20:24:45] This will increase client side errors.. im just not sure by how much :) [20:30:08] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:914863|Enable graphs on test wikipedia and mediawiki.org (T334940)]] (duration: 08m 19s) [20:30:11] T334940: All Graphs broken on Wikimedia wikis (due to security issue T334895) - https://phabricator.wikimedia.org/T334940 [20:30:11] Jdlrobson: graphs patch should be live - moving on to your 2nd one [20:30:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913241 (https://phabricator.wikimedia.org/T270603) (owner: 10Jdlrobson) [20:31:21] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:43] (03Merged) 10jenkins-bot: Explicitly enable MFCustomSiteModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913241 (https://phabricator.wikimedia.org/T270603) (owner: 10Jdlrobson) [20:32:10] !log cjming@deploy1002 Started scap: Backport for [[gerrit:913241|Explicitly enable MFCustomSiteModules (T270603)]] [20:32:13] T270603: Module site.styles generates different output depending on mobile cookie, if $wgMFSiteStylesRenderBlocking = true; - https://phabricator.wikimedia.org/T270603 [20:33:17] this one should be easy to check [20:33:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [20:33:40] !log cjming@deploy1002 jdlrobson and cjming: Backport for [[gerrit:913241|Explicitly enable MFCustomSiteModules (T270603)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:33:42] Jdlrobson: wanna check your 2nd patch? [20:34:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T335838)', diff saved to https://phabricator.wikimedia.org/P47480 and previous config saved to /var/cache/conftool/dbconfig/20230503-203424-ladsgroup.json [20:36:44] checking.. [20:37:01] LGTM claime [20:37:04] cjming: [20:37:09] cool - syncing [20:37:45] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:42:34] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:913241|Explicitly enable MFCustomSiteModules (T270603)]] (duration: 10m 23s) [20:42:38] T270603: Module site.styles generates different output depending on mobile cookie, if $wgMFSiteStylesRenderBlocking = true; - https://phabricator.wikimedia.org/T270603 [20:43:15] !log cjming@deploy1002 Started scap: Backport for [[gerrit:914301|Router handling code should be centralized into mmv.bootstrap (T236591)]] [20:43:18] T236591: Exiting an image displayed via mediaviewer on wikipedia takes you back one site in browser history instead of taking you to base article - https://phabricator.wikimedia.org/T236591 [20:43:42] Jdlrobson: 2nd patch should be live - doing your backport now [20:43:42] (httpbb succeeded on a retry, so that error was unrelated to the deploy) [20:43:45] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:44:46] !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:914301|Router handling code should be centralized into mmv.bootstrap (T236591)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [20:45:05] Jdlrobson: is backport testable? [20:45:11] yep! [20:46:12] (SystemdUnitFailed) firing: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2040:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:46:15] PROBLEM - Check systemd state on elastic2069 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:19] PROBLEM - Check systemd state on elastic2040 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:05] PROBLEM - Check systemd state on elastic2055 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:21] cjming: let me know when [20:47:30] Jdlrobson: oh - lmk if i should sync? [20:47:37] please test [20:47:49] RECOVERY - Check systemd state on elastic2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:49] yep lgtm [20:47:56] nice - going live [20:48:17] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:50:13] RECOVERY - Check systemd state on elastic2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:53] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:59] RECOVERY - Check systemd state on elastic2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:12] (SystemdUnitFailed) resolved: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2040:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:23] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:914301|Router handling code should be centralized into mmv.bootstrap (T236591)]] (duration: 10m 08s) [20:53:27] T236591: Exiting an image displayed via mediaviewer on wikipedia takes you back one site in browser history instead of taking you to base article - https://phabricator.wikimedia.org/T236591 [20:53:41] Jdlrobson: all live! [20:53:51] THANKS A BUNCH CLARE! [20:53:56] lol - yw! [20:54:11] !log end of UTC late backport window [20:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:51] PROBLEM - Check systemd state on elastic2038 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:12] (SystemdUnitFailed) firing: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:06:33] PROBLEM - Check systemd state on elastic2062 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:27] RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:11:12] (SystemdUnitFailed) resolved: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:11:15] !log Upgrading pybal to 1.15.11 on lvs4010 [21:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:21] RECOVERY - Check systemd state on elastic2062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:45] PROBLEM - Check systemd state on elastic2056 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:57] PROBLEM - Check systemd state on elastic2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:21] RECOVERY - Check systemd state on elastic2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:31] RECOVERY - Check systemd state on elastic2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:40] !log milimetric@deploy1002 Started deploy [analytics/refinery@c53c095]: Refinery deploy [analytics/refinery@c53c095] [21:31:03] !log milimetric@deploy1002 Finished deploy [analytics/refinery@c53c095]: Refinery deploy [analytics/refinery@c53c095] (duration: 08m 22s) [21:31:12] (SystemdUnitFailed) firing: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:31:13] PROBLEM - Check systemd state on elastic2060 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:31:36] !log Uploaded pybal_1.15.11 to apt1001 via reprepro [21:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:49] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[17-33].eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [21:31:51] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [21:32:01] PROBLEM - Check systemd state on elastic2085 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:49] RECOVERY - Check systemd state on elastic2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:11] RECOVERY - Check systemd state on elastic2085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:47] (03PS1) 10BCornwall: Revert "Revert "pybal: Switch ulsfo LVS to use Maglev scheduler"" [puppet] - 10https://gerrit.wikimedia.org/r/914838 [21:36:12] (SystemdUnitFailed) resolved: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:39:59] (03PS2) 10BCornwall: Revert "Revert "pybal: Switch ulsfo LVS to use Maglev scheduler"" [puppet] - 10https://gerrit.wikimedia.org/r/914838 [21:40:17] PROBLEM - Check systemd state on elastic2072 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:38] (03CR) 10BBlack: [C: 03+1] Revert "Revert "pybal: Switch ulsfo LVS to use Maglev scheduler"" [puppet] - 10https://gerrit.wikimedia.org/r/914838 (owner: 10BCornwall) [21:41:12] (SystemdUnitFailed) firing: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2051:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:41:20] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41025/console" [puppet] - 10https://gerrit.wikimedia.org/r/914838 (owner: 10BCornwall) [21:41:51] RECOVERY - Check systemd state on elastic2072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:42:10] (03CR) 10BCornwall: [V: 03+1 C: 03+2] Revert "Revert "pybal: Switch ulsfo LVS to use Maglev scheduler"" [puppet] - 10https://gerrit.wikimedia.org/r/914838 (owner: 10BCornwall) [21:43:23] !log milimetric@deploy1002 Started deploy [analytics/refinery@c53c095] (thin): Deploy THIN [analytics/refinery@c53c095] [21:43:29] !log milimetric@deploy1002 Finished deploy [analytics/refinery@c53c095] (thin): Deploy THIN [analytics/refinery@c53c095] (duration: 00m 06s) [21:46:12] (SystemdUnitFailed) resolved: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2051:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:48:05] PROBLEM - Check systemd state on elastic2053 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:48:05] (03PS1) 10Eevans: aqs: canary aqs1010 & aqs2001 to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/914897 (https://phabricator.wikimedia.org/T335383) [21:49:18] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/914897 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [21:52:45] (03PS2) 10Eevans: aqs: canary aqs1010 & aqs2001 to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/914897 (https://phabricator.wikimedia.org/T335383) [21:53:30] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/914897 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [21:55:21] !log Disable puppet on lvs4008 for new pybal deployment (just in case immediate config rollback is required) - T263797 [21:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:25] T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 [21:55:59] RECOVERY - Check systemd state on elastic2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:59] (03CR) 10Eevans: [C: 03+2] aqs: canary aqs1010 & aqs2001 to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/914897 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [21:57:15] PROBLEM - Check systemd state on elastic2064 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:58:51] RECOVERY - Check systemd state on elastic2064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:26] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs2001.codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [22:00:30] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [22:06:12] (SystemdUnitFailed) firing: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:08:05] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs2001.codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [22:08:09] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [22:08:27] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/914897 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [22:10:02] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder) [22:11:12] (SystemdUnitFailed) resolved: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:11:28] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs1010.eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [22:15:33] PROBLEM - Check systemd state on elastic2058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:12] (SystemdUnitFailed) firing: (17) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:17:09] RECOVERY - Check systemd state on elastic2058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:19:21] (03PS2) 10Dzahn: trafficserver: change name of the miscweb backend to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/914881 [22:19:40] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs1010.eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [22:19:44] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [22:19:46] (03CR) 10CI reject: [V: 04-1] trafficserver: change name of the miscweb backend to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/914881 (owner: 10Dzahn) [22:21:12] (SystemdUnitFailed) resolved: (11) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:21:26] (03CR) 10Dzahn: gerrit: move hieradata from role/common to common/profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911920 (owner: 10Dzahn) [22:24:25] PROBLEM - Check systemd state on elastic2044 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:24:28] (03PS3) 10Dzahn: trafficserver: change name of the miscweb backend to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/914881 [22:26:12] (SystemdUnitFailed) firing: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:27:20] (03PS2) 10Dzahn: gerrit: move hieradata from role/common to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/911920 [22:27:22] (03PS1) 10Dzahn: gerrit: move all gerrit::profile hiera keys to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/914901 [22:28:05] (03CR) 10Dzahn: gerrit: move hieradata from role/common to common/profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911920 (owner: 10Dzahn) [22:28:19] (03Abandoned) 10Dzahn: gerrit: move all gerrit::profile hiera keys to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/914901 (owner: 10Dzahn) [22:30:53] (03PS3) 10Dzahn: gerrit: move hieradata from role/common to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/911920 [22:31:12] (SystemdUnitFailed) resolved: (9) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2043:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:33:00] (03PS1) 10Zabe: Start writing to af_actor/afh_actor in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914903 (https://phabricator.wikimedia.org/T334295) [22:33:15] jouncebot: nowandnext [22:33:15] No deployments scheduled for the next 7 hour(s) and 26 minute(s) [22:33:15] In 7 hour(s) and 26 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T0600) [22:33:15] In 7 hour(s) and 26 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T0600) [22:33:19] PROBLEM - Check systemd state on elastic2080 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:49] (03CR) 10Dzahn: gerrit: move hieradata from role/common to common/profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911920 (owner: 10Dzahn) [22:33:55] (03CR) 10Zabe: [C: 03+2] Start writing to af_actor/afh_actor in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914903 (https://phabricator.wikimedia.org/T334295) (owner: 10Zabe) [22:34:39] !log removing 12 files for legal compliance [22:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:40] (03Merged) 10jenkins-bot: Start writing to af_actor/afh_actor in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914903 (https://phabricator.wikimedia.org/T334295) (owner: 10Zabe) [22:35:30] !log zabe@deploy1002 Started scap: Backport for [[gerrit:914903|Start writing to af_actor/afh_actor in group1 wikis (T334295)]] [22:35:30] (03PS4) 10Dzahn: gerrit: move hieradata from role/common to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/911920 [22:35:32] T334295: Write to af_actor/afh_actor in production - https://phabricator.wikimedia.org/T334295 [22:35:33] RECOVERY - Check systemd state on elastic2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:06] !log zabe@deploy1002 zabe: Backport for [[gerrit:914903|Start writing to af_actor/afh_actor in group1 wikis (T334295)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [22:41:07] PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:11] RECOVERY - Check systemd state on elastic2080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:12] (SystemdUnitFailed) firing: (14) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2042:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:41:21] PROBLEM - Check systemd state on elastic2083 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:21] PROBLEM - Check systemd state on elastic2081 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:41] RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:43] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:914903|Start writing to af_actor/afh_actor in group1 wikis (T334295)]] (duration: 07m 13s) [22:42:47] T334295: Write to af_actor/afh_actor in production - https://phabricator.wikimedia.org/T334295 [22:42:55] RECOVERY - Check systemd state on elastic2083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:55] RECOVERY - Check systemd state on elastic2081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:46:12] (SystemdUnitFailed) resolved: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2042:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:50:03] PROBLEM - Check systemd state on elastic2086 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:51:12] (SystemdUnitFailed) firing: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2042:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:51:39] RECOVERY - Check systemd state on elastic2086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:12] (SystemdUnitFailed) resolved: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2047:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:57:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [22:57:19] PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:13] RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:43] PROBLEM - Check systemd state on elastic2073 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:07:05] PROBLEM - Check systemd state on elastic2061 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:17] RECOVERY - Check systemd state on elastic2073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:39] RECOVERY - Check systemd state on elastic2061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:02] !log removing 1 file for legal compliance [23:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:47] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001 - T335835 [23:16:12] (SystemdUnitFailed) firing: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:21:12] (SystemdUnitFailed) resolved: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:34:30] (03PS1) 10EoghanGaffney: [spicerack/decorators] Don't miss dry_run if it's disabled in kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/914923 (https://phabricator.wikimedia.org/T335855) [23:35:11] (03PS2) 10EoghanGaffney: [spicerack/decorators] Don't miss dry_run if it's disabled in kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/914923 (https://phabricator.wikimedia.org/T335855) [23:47:22] (03PS1) 10Xcollazo: Add configs to spark-defaults.conf to enable Iceberg. [puppet] - 10https://gerrit.wikimedia.org/r/914928 (https://phabricator.wikimedia.org/T335721)