[00:20:45] <icinga-wm>	 RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops
[00:28:31] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[00:39:22] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/913943
[00:39:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/913943 (owner: 10TrainBranchBot)
[00:47:08] <sukhe>	 !log restart haproxy on cp2031: T334448
[00:47:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:47:11] <stashbot>	 T334448: HAProxy 2.6.12 segfaults - https://phabricator.wikimedia.org/T334448
[00:56:51] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/913943 (owner: 10TrainBranchBot)
[01:12:42] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-webproxy: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914416 (https://phabricator.wikimedia.org/T330759)
[01:12:44] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-wikireplica-dns: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914417 (https://phabricator.wikimedia.org/T330759)
[01:12:47] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-enc-cli:  use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914418 (https://phabricator.wikimedia.org/T330759)
[01:12:48] <wikibugs>	 (03PS3) 10Andrew Bogott: wmcs notify_maintainers: use mwopenstackclients for keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/914419 (https://phabricator.wikimedia.org/T330759)
[01:55:17] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder)
[02:07:54] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:23:30] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:26:27] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:42:09] <wikibugs>	 (03PS10) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix)
[02:45:02] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-spreadcheck: use clouds.yaml section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914463 (https://phabricator.wikimedia.org/T330759)
[02:45:04] <wikibugs>	 (03PS1) 10Andrew Bogott: nfs-exportd: convert to using mwopenstackclients and --os-cloud [puppet] - 10https://gerrit.wikimedia.org/r/914464 (https://phabricator.wikimedia.org/T330759)
[03:38:30] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:28:31] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[04:48:54] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2023-05-03-044244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914468 (https://phabricator.wikimedia.org/T333835)
[05:32:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] fastapi-app: add networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914342 (owner: 10Giuseppe Lavagetto)
[05:32:52] <wikibugs>	 (03PS5) 10Elukey: ml-services: add network policies for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914319 (https://phabricator.wikimedia.org/T330414)
[05:39:17] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: Disconnecting codfw > eqiad  T335267
[05:39:21] <stashbot>	 T335267: Disconnect replication codfw -> eqiad - https://phabricator.wikimedia.org/T335267
[05:39:32] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: Disconnecting codfw > eqiad  T335267
[05:40:33] <marostegui>	 !log Disconnect codfw -> eqiad replication on pc1 T335267
[05:40:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:40:36] <marostegui>	 !log Disconnect codfw -> eqiad replication on pc2 T335267
[05:40:38] <marostegui>	 !log Disconnect codfw -> eqiad replication on pc3 T335267
[05:40:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:40:44] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add network policies for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914319 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey)
[05:40:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Disconnecting codfw > eqiad  T335267
[05:41:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Disconnecting codfw > eqiad  T335267
[05:41:06] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Disconnecting codfw > eqiad  T335267
[05:41:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Disconnecting codfw > eqiad  T335267
[05:44:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 10 hosts with reason: Disconnecting codfw > eqiad  T335267
[05:44:10] <marostegui>	 !log Disconnect codfw -> eqiad replication on x1 T335267
[05:44:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:44:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 10 hosts with reason: Disconnecting codfw > eqiad  T335267
[05:44:28] <stashbot>	 T335267: Disconnect replication codfw -> eqiad - https://phabricator.wikimedia.org/T335267
[05:47:08] <wikibugs>	 (03PS1) 10Samwilson: Remove duplicated diff-mode selector in save dialog [extensions/VisualEditor] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914429 (https://phabricator.wikimedia.org/T324759)
[05:48:10] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 6 hosts with reason: Disconnecting codfw > eqiad  T335267
[05:48:14] <marostegui>	 !log Disconnect codfw -> eqiad replication on es4 T335267
[05:48:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:48:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 6 hosts with reason: Disconnecting codfw > eqiad  T335267
[05:51:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 6 hosts with reason: Disconnecting codfw > eqiad  T335267
[05:51:27] <marostegui>	 !log Disconnect codfw -> eqiad replication on es5 T335267
[05:51:28] <stashbot>	 T335267: Disconnect replication codfw -> eqiad - https://phabricator.wikimedia.org/T335267
[05:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:51:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 6 hosts with reason: Disconnecting codfw > eqiad  T335267
[05:54:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 27 hosts with reason: Disconnecting codfw > eqiad  T335267
[05:54:10] <marostegui>	 !log Disconnect codfw -> eqiad replication on s6 T335267
[05:54:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:54:25] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 27 hosts with reason: Disconnecting codfw > eqiad  T335267
[05:57:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 27 hosts with reason: Disconnecting codfw > eqiad  T335267
[05:57:46] <marostegui>	 !log Disconnect codfw -> eqiad replication on s2 T335267
[05:57:47] <stashbot>	 T335267: Disconnect replication codfw -> eqiad - https://phabricator.wikimedia.org/T335267
[05:57:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:15] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 27 hosts with reason: Disconnecting codfw > eqiad  T335267
[05:59:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 26 hosts with reason: Disconnecting codfw > eqiad  T335267
[05:59:22] <marostegui>	 !log Disconnect codfw -> eqiad replication on s5 T335267
[05:59:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:59:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 26 hosts with reason: Disconnecting codfw > eqiad  T335267
[06:00:02] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T0600)
[06:01:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 24 hosts with reason: Disconnecting codfw > eqiad  T335267
[06:01:58] <marostegui>	 !log Disconnect codfw -> eqiad replication on s3 T335267
[06:02:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:02:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 24 hosts with reason: Disconnecting codfw > eqiad  T335267
[06:06:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 28 hosts with reason: Disconnecting codfw > eqiad  T335267
[06:06:29] <stashbot>	 T335267: Disconnect replication codfw -> eqiad - https://phabricator.wikimedia.org/T335267
[06:06:36] <marostegui>	 !log Disconnect codfw -> eqiad replication on s7 T335267
[06:06:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:06:45] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 28 hosts with reason: Disconnecting codfw > eqiad  T335267
[06:09:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 35 hosts with reason: Disconnecting codfw > eqiad  T335267
[06:09:37] <marostegui>	 !log Disconnect codfw -> eqiad replication on s4 T335267
[06:09:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:09:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 35 hosts with reason: Disconnecting codfw > eqiad  T335267
[06:14:06] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 34 hosts with reason: Disconnecting codfw > eqiad  T335267
[06:14:09] <marostegui>	 !log Disconnect codfw -> eqiad replication on s8 T335267
[06:14:09] <stashbot>	 T335267: Disconnect replication codfw -> eqiad - https://phabricator.wikimedia.org/T335267
[06:14:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 34 hosts with reason: Disconnecting codfw > eqiad  T335267
[06:19:23] <wikibugs>	 (03PS2) 10EoghanGaffney: [gitlab/runner] Add basic pool/depool commands [puppet] - 10https://gerrit.wikimedia.org/r/913199
[06:20:10] <wikibugs>	 (03PS1) 10Jelto: aptrepo: update gitlab-ce and gitlab-runner to 15.9 [puppet] - 10https://gerrit.wikimedia.org/r/914594 (https://phabricator.wikimedia.org/T335784)
[06:23:30] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job thanos-sidecar in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:25:25] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Apply black to all python files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/907880 (owner: 10Ayounsi)
[06:26:23] <wikibugs>	 (03Merged) 10jenkins-bot: Apply black to all python files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/907880 (owner: 10Ayounsi)
[06:26:25] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sites.yaml: remove decommissioned host lvs2007 from lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/914344 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh)
[06:28:54] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox
[06:29:01] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox
[06:41:13] <wikibugs>	 (03PS1) 10Ayounsi: Replace most .format() to f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597
[06:42:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Replace most .format() to f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597 (owner: 10Ayounsi)
[06:43:33] <wikibugs>	 (03PS2) 10Ayounsi: Replace most .format() to f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597
[06:45:27] <wikibugs>	 (03PS5) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570
[06:46:00] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 38 hosts with reason: Disconnecting codfw > eqiad  T335267
[06:46:03] <marostegui>	 !log Disconnect codfw -> eqiad replication on s1 T335267
[06:46:03] <stashbot>	 T335267: Disconnect replication codfw -> eqiad - https://phabricator.wikimedia.org/T335267
[06:46:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 38 hosts with reason: Disconnecting codfw > eqiad  T335267
[06:47:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi)
[06:48:23] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10Marostegui) @Papaul what else do they need? We have pasted their idrac's log
[06:48:27] <wikibugs>	 (03CR) 10Ayounsi: "Messages Found: 298 with Ib6aaba35a1aa34ac1680110a6fc265bf9b72bfb9" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi)
[06:50:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1117.eqiad.wmnet
[06:51:16] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Decommission db1117 [puppet] - 10https://gerrit.wikimedia.org/r/914696 (https://phabricator.wikimedia.org/T335017)
[06:53:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add python-all to make pybal buildable on build2001 [puppet] - 10https://gerrit.wikimedia.org/r/914349 (owner: 10Muehlenhoff)
[06:53:18] <wikibugs>	 (03PS2) 10Muehlenhoff: Add python-all to make pybal buildable on build2001 [puppet] - 10https://gerrit.wikimedia.org/r/914349
[06:55:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Decommission db1117 [puppet] - 10https://gerrit.wikimedia.org/r/914696 (https://phabricator.wikimedia.org/T335017) (owner: 10Marostegui)
[06:55:53] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 (owner: 10Giuseppe Lavagetto)
[06:56:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[06:56:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/914594 (https://phabricator.wikimedia.org/T335784) (owner: 10Jelto)
[06:57:01] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] aptrepo: update gitlab-ce and gitlab-runner to 15.9 [puppet] - 10https://gerrit.wikimedia.org/r/914594 (https://phabricator.wikimedia.org/T335784) (owner: 10Jelto)
[06:58:01] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission db1117.eqiad.wmnet - https://phabricator.wikimedia.org/T335017 (10Marostegui)
[06:58:51] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Replace db1117 with db1217 [dns] - 10https://gerrit.wikimedia.org/r/914697 (https://phabricator.wikimedia.org/T335017)
[06:59:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Synchronize only the /srv/prometheus directory when migrating data [puppet] - 10https://gerrit.wikimedia.org/r/914400 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse)
[06:59:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Replace db1117 with db1217 [dns] - 10https://gerrit.wikimedia.org/r/914697 (https://phabricator.wikimedia.org/T335017) (owner: 10Marostegui)
[07:00:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Need to remove ./hieradata/hosts/prometheus3001.yaml too" [puppet] - 10https://gerrit.wikimedia.org/r/913249 (https://phabricator.wikimedia.org/T33558) (owner: 10Andrea Denisse)
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T0700)
[07:00:05] <jouncebot>	 samwilson: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Need to remove ./hieradata/hosts/prometheus4001.yaml too" [puppet] - 10https://gerrit.wikimedia.org/r/913250 (https://phabricator.wikimedia.org/T335585) (owner: 10Andrea Denisse)
[07:01:03] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1214 [puppet] - 10https://gerrit.wikimedia.org/r/914698
[07:01:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1117.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[07:01:28] <samwilson>	 Amir1 urbanecm taavi hullo, I'm present; is one of you deploying today?
[07:01:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1214 [puppet] - 10https://gerrit.wikimedia.org/r/914698 (owner: 10Marostegui)
[07:01:40] <taavi>	 yep, give me a second
[07:02:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914429 (https://phabricator.wikimedia.org/T324759) (owner: 10Samwilson)
[07:02:20] <samwilson>	 no hurry :) 
[07:02:32] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1117.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[07:02:32] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:02:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1117.eqiad.wmnet
[07:02:39] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission db1117.eqiad.wmnet - https://phabricator.wikimedia.org/T335017 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1117.eqiad.wmnet` - db1117.eqiad.wmnet (**WARN**)   - Downtimed host on...
[07:05:10] <wikibugs>	 10ops-codfw, 10DBA: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Marostegui) @Jhancock.wm  @Papaul let me know what do you need to make this happen? Do you need to turn the host off completely or just the idrac?
[07:07:39] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1213 (s5,s6) to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/914699 (https://phabricator.wikimedia.org/T326669)
[07:08:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1213 (s5,s6) to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/914699 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[07:09:29] <moritzm>	 !log installing glibc bugfix updates from bullseye point release
[07:09:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1213 (s5,s6) to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P47297 and previous config saved to /var/cache/conftool/dbconfig/20230503-071046-marostegui.json
[07:10:49] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[07:11:44] <wikibugs>	 (03PS1) 10Marostegui: db1213: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/914700 (https://phabricator.wikimedia.org/T326669)
[07:12:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1213: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/914700 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[07:13:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 1%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47298 and previous config saved to /var/cache/conftool/dbconfig/20230503-071303-root.json
[07:13:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 1%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47299 and previous config saved to /var/cache/conftool/dbconfig/20230503-071313-root.json
[07:15:03] <wikibugs>	 (03PS1) 10Ayounsi: Add style checker and auto-formater to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701
[07:17:48] <wikibugs>	 (03Merged) 10jenkins-bot: Remove duplicated diff-mode selector in save dialog [extensions/VisualEditor] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914429 (https://phabricator.wikimedia.org/T324759) (owner: 10Samwilson)
[07:17:59] <taavi>	 finally
[07:18:40] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:914429|Remove duplicated diff-mode selector in save dialog (T324759)]]
[07:18:43] <stashbot>	 T324759: Inline Diff: Add legend and tooltips - https://phabricator.wikimedia.org/T324759
[07:19:24] <icinga-wm>	 PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: phabricator_clean_tmp_files.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:19:58] <wikibugs>	 (03PS2) 10Ayounsi: Add style checker and auto-formater to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701
[07:20:13] <logmsgbot>	 !log taavi@deploy1002 taavi and samwilson: Backport for [[gerrit:914429|Remove duplicated diff-mode selector in save dialog (T324759)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[07:20:19] <taavi>	 samwilson: please test
[07:20:41] <samwilson>	 thanks. testing now.
[07:21:43] <wikibugs>	 (03PS6) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570
[07:22:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi)
[07:22:47] <samwilson>	 taavi: all looks good
[07:22:54] <taavi>	 ok, syncing
[07:22:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff)
[07:23:30] <wikibugs>	 10SRE, 10DBA: db1132 index for table pagetriage_page is corrupt - https://phabricator.wikimedia.org/T335632 (10Marostegui) I have emailed Monty about this, as it affected 11.1 too - we'll see what he says. My guess is that this is not something specific for 10.6 or 11.0 as the 10.6 hosts in codfw didn't have t...
[07:26:49] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[07:28:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 2%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47302 and previous config saved to /var/cache/conftool/dbconfig/20230503-072808-root.json
[07:28:12] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[07:28:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 2%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47303 and previous config saved to /var/cache/conftool/dbconfig/20230503-072818-root.json
[07:28:55] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:914429|Remove duplicated diff-mode selector in save dialog (T324759)]] (duration: 10m 14s)
[07:28:58] <stashbot>	 T324759: Inline Diff: Add legend and tooltips - https://phabricator.wikimedia.org/T324759
[07:29:21] <taavi>	 deployed!
[07:29:41] <samwilson>	 thanks! :) 
[07:36:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 T335011', diff saved to https://phabricator.wikimedia.org/P47304 and previous config saved to /var/cache/conftool/dbconfig/20230503-073602-root.json
[07:36:07] <stashbot>	 T335011: decommission db1110.eqiad.wmnet - https://phabricator.wikimedia.org/T335011
[07:36:33] <wikibugs>	 (03PS1) 10Marostegui: db1110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/914702 (https://phabricator.wikimedia.org/T335011)
[07:37:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/914702 (https://phabricator.wikimedia.org/T335011) (owner: 10Marostegui)
[07:38:30] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:43:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 3%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47305 and previous config saved to /var/cache/conftool/dbconfig/20230503-074313-root.json
[07:43:17] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[07:43:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 3%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47306 and previous config saved to /var/cache/conftool/dbconfig/20230503-074323-root.json
[07:44:00] <wikibugs>	 (03CR) 10Superpes15: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil)
[07:44:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Create autopatroller and patroller groups on bn.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil)
[07:44:50] <wikibugs>	 (03PS1) 10Marostegui: db1118,db1110: Update notes [puppet] - 10https://gerrit.wikimedia.org/r/914703 (https://phabricator.wikimedia.org/T335011)
[07:45:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1118,db1110: Update notes [puppet] - 10https://gerrit.wikimedia.org/r/914703 (https://phabricator.wikimedia.org/T335011) (owner: 10Marostegui)
[07:46:25] <wikibugs>	 (03CR) 10Superpes15: "seems you used spaces instead of tab! please fix it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil)
[07:48:22] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Install software version upgrade
[07:52:23] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597 (owner: 10Ayounsi)
[07:56:14] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "The new envs can't be run and the style one is not run by CI" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 (owner: 10Ayounsi)
[07:56:19] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371
[07:56:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 (owner: 10Giuseppe Lavagetto)
[07:57:03] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org
[07:58:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 4%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47307 and previous config saved to /var/cache/conftool/dbconfig/20230503-075818-root.json
[07:58:21] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[07:58:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 4%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47308 and previous config saved to /var/cache/conftool/dbconfig/20230503-075828-root.json
[08:01:18] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371
[08:01:28] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:01:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 (owner: 10Giuseppe Lavagetto)
[08:02:23] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371
[08:04:02] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org
[08:12:21] <urbanecm>	 jouncebot: nowandnext
[08:12:21] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 47 minute(s)
[08:12:21] <jouncebot>	 In 1 hour(s) and 47 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1000)
[08:12:51] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Personalized praise: Let mentors to skip suggestions [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914426 (https://phabricator.wikimedia.org/T334300) (owner: 10Urbanecm)
[08:13:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 5%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47309 and previous config saved to /var/cache/conftool/dbconfig/20230503-081323-root.json
[08:13:27] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[08:13:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 5%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47310 and previous config saved to /var/cache/conftool/dbconfig/20230503-081332-root.json
[08:14:25] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10HTTPS: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Clement_Goubert)
[08:15:27] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10HTTPS: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Clement_Goubert) New internal certs now include `wikifunctions.org` an...
[08:15:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914426 (https://phabricator.wikimedia.org/T334300) (owner: 10Urbanecm)
[08:16:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 (owner: 10Giuseppe Lavagetto)
[08:28:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 10%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47311 and previous config saved to /var/cache/conftool/dbconfig/20230503-082827-root.json
[08:28:30] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[08:28:32] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[08:28:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 10%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47312 and previous config saved to /var/cache/conftool/dbconfig/20230503-082837-root.json
[08:32:08] <wikibugs>	 (03Merged) 10jenkins-bot: Personalized praise: Let mentors to skip suggestions [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914426 (https://phabricator.wikimedia.org/T334300) (owner: 10Urbanecm)
[08:32:37] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914426|Personalized praise: Let mentors to skip suggestions (T334300)]]
[08:32:40] <stashbot>	 T334300: Personalized praise: design for skipping praiseworthy mentees - https://phabricator.wikimedia.org/T334300
[08:38:14] <wikibugs>	 (03PS5) 10Urbanecm: [Growth] Add GEMentorDashboardEnabledModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914373 (https://phabricator.wikimedia.org/T334630)
[08:38:34] <wikibugs>	 (03PS4) 10Urbanecm: [Growth] Remove GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914375 (https://phabricator.wikimedia.org/T334630)
[08:38:41] <wikibugs>	 (03PS4) 10Urbanecm: [Growth] Deploy Personalized praise to AR, BN, CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914393 (https://phabricator.wikimedia.org/T334630)
[08:39:29] <logmsgbot>	 !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Install software version upgrade
[08:39:42] <marostegui>	 !log dbmaint deploy schema change on eqiad s3 with replication T335834
[08:39:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:45] <stashbot>	 T335834: Update cx_section_translations table - https://phabricator.wikimedia.org/T335834
[08:41:18] <wikibugs>	 (03PS1) 10Hashar: ci: daily update all git cache repositories [puppet] - 10https://gerrit.wikimedia.org/r/914710
[08:41:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ci: daily update all git cache repositories [puppet] - 10https://gerrit.wikimedia.org/r/914710 (owner: 10Hashar)
[08:43:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 25%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47313 and previous config saved to /var/cache/conftool/dbconfig/20230503-084332-root.json
[08:43:36] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[08:43:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 25%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47314 and previous config saved to /var/cache/conftool/dbconfig/20230503-084342-root.json
[08:44:57] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Install software version upgrade
[08:47:35] <wikibugs>	 (03PS2) 10Hashar: ci: daily update all git cache repositories [puppet] - 10https://gerrit.wikimedia.org/r/914710
[08:48:38] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10ayounsi) >  The obvious solution is to allow passing of the specific IP to use, and default to $facts['ipaddre...
[08:58:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 50%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47315 and previous config saved to /var/cache/conftool/dbconfig/20230503-085837-root.json
[08:58:41] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[08:58:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 50%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47316 and previous config saved to /var/cache/conftool/dbconfig/20230503-085847-root.json
[08:59:45] <wikibugs>	 (03PS1) 10Hashar: ci: add a couple extensions to git cache [puppet] - 10https://gerrit.wikimedia.org/r/914711
[09:00:17] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:914426|Personalized praise: Let mentors to skip suggestions (T334300)]] (duration: 27m 39s)
[09:00:22] <stashbot>	 T334300: Personalized praise: design for skipping praiseworthy mentees - https://phabricator.wikimedia.org/T334300
[09:01:22] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief1001.eqiad.wmnet
[09:01:35] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host acmechief1001.eqiad.wmnet
[09:02:11] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief1001.eqiad.wmnet
[09:02:56] <wikibugs>	 (03CR) 10Subramanya Sastry: "I am going to try to get this deployed in a backport window today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T335157) (owner: 10C. Scott Ananian)
[09:05:43] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1001.eqiad.wmnet
[09:05:55] <kart_>	 marostegui: Re: T335834, we are testing on testwiki. Need some time for that.
[09:05:56] <stashbot>	 T335834: Update cx_section_translations table - https://phabricator.wikimedia.org/T335834
[09:06:15] <marostegui>	 kart_: No rush, I won't probably get to wikishared till next week anyways
[09:06:38] <kart_>	 marostegui: Thanks. 
[09:08:33] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief2001.codfw.wmnet
[09:11:42] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914426|Personalized praise: Let mentors to skip suggestions (T334300)]]
[09:11:44] <stashbot>	 T334300: Personalized praise: design for skipping praiseworthy mentees - https://phabricator.wikimedia.org/T334300
[09:11:48] <logmsgbot>	 !log urbanecm@deploy1002 sync-world aborted: Backport for [[gerrit:914426|Personalized praise: Let mentors to skip suggestions (T334300)]] (duration: 00m 06s)
[09:12:20] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief2001.codfw.wmnet
[09:12:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914373 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm)
[09:12:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 (owner: 10Giuseppe Lavagetto)
[09:13:20] <wikibugs>	 (03Merged) 10jenkins-bot: Rake: run with heuristics by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/914371 (owner: 10Giuseppe Lavagetto)
[09:13:29] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] Add GEMentorDashboardEnabledModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914373 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm)
[09:13:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 75%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47317 and previous config saved to /var/cache/conftool/dbconfig/20230503-091342-root.json
[09:13:46] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[09:13:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 75%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47318 and previous config saved to /var/cache/conftool/dbconfig/20230503-091352-root.json
[09:13:58] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914373|[Growth] Add GEMentorDashboardEnabledModules (T334630)]]
[09:14:01] <stashbot>	 T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630
[09:14:18] <wikibugs>	 (03PS1) 10Urbanecm: Personalized praise: Run convertNumber() before displaying numbers [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914435 (https://phabricator.wikimedia.org/T322443)
[09:14:29] <wikibugs>	 (03PS1) 10Urbanecm: Personalized praise: Run convertNumber() before displaying numbers [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914436 (https://phabricator.wikimedia.org/T322443)
[09:15:39] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:17:03] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet
[09:17:20] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org
[09:19:24] <wikibugs>	 (03PS1) 10Volans: Release v0.6.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/914716
[09:20:25] <icinga-wm>	 RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:20:39] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:20:47] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test1001.eqiad.wmnet
[09:20:55] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:914373|[Growth] Add GEMentorDashboardEnabledModules (T334630)]] (duration: 06m 56s)
[09:20:58] <stashbot>	 T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630
[09:21:16] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test2001.codfw.wmnet
[09:21:44] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Release v0.6.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/914716 (owner: 10Volans)
[09:21:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914436 (https://phabricator.wikimedia.org/T322443) (owner: 10Urbanecm)
[09:21:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914435 (https://phabricator.wikimedia.org/T322443) (owner: 10Urbanecm)
[09:22:49] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: recommendation-api: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905943 (https://phabricator.wikimedia.org/T334062) (owner: 10Clément Goubert)
[09:23:30] <jinxer-wm>	 (SystemdUnitFailed) resolved: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:24:21] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Release v0.6.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/914716 (owner: 10Volans)
[09:24:21] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org
[09:24:26] <wikibugs>	 (03PS1) 10Marostegui: db2124: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/914717 (https://phabricator.wikimedia.org/T334650)
[09:24:45] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: changeprop: make num_workers configurable for jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/826570 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[09:24:49] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2124.codfw.wmnet with reason: Migrating to 10.6 and rebooting
[09:25:00] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test2001.codfw.wmnet
[09:25:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2124.codfw.wmnet with reason: Migrating to 10.6 and rebooting
[09:25:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2124', diff saved to https://phabricator.wikimedia.org/P47319 and previous config saved to /var/cache/conftool/dbconfig/20230503-092513-root.json
[09:26:42] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw
[09:27:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2124: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/914717 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui)
[09:28:19] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.2 - volans@cumin1001
[09:28:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 100%: Pooling db1213:3315 T326669', diff saved to https://phabricator.wikimedia.org/P47320 and previous config saved to /var/cache/conftool/dbconfig/20230503-092847-root.json
[09:28:50] <stashbot>	 T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669
[09:28:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 100%: Pooling db1213:3316 T326669', diff saved to https://phabricator.wikimedia.org/P47321 and previous config saved to /var/cache/conftool/dbconfig/20230503-092856-root.json
[09:29:37] <wikibugs>	 (03PS1) 10Marostegui: db2124: Enable notications [puppet] - 10https://gerrit.wikimedia.org/r/914718
[09:29:45] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2003.wikimedia.org
[09:29:57] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.2 - volans@cumin1001
[09:30:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netbox: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997 (10jbond)
[09:31:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Revert "sre.hosts.reimage/sre.ganeti.reimage: Delete Puppet state file before reimage" [cookbooks] - 10https://gerrit.wikimedia.org/r/912311 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[09:33:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2124: Enable notications [puppet] - 10https://gerrit.wikimedia.org/r/914718 (owner: 10Marostegui)
[09:34:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "not tested but lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/910461 (https://phabricator.wikimedia.org/T334880) (owner: 10Volans)
[09:35:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 1%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47322 and previous config saved to /var/cache/conftool/dbconfig/20230503-093503-root.json
[09:35:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[09:36:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[09:36:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T335838)', diff saved to https://phabricator.wikimedia.org/P47323 and previous config saved to /var/cache/conftool/dbconfig/20230503-093606-ladsgroup.json
[09:36:21] <Lucas_WMDE>	 jouncebot: nowandnext
[09:36:21] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 23 minute(s)
[09:36:21] <jouncebot>	 In 0 hour(s) and 23 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1000)
[09:36:35] <Lucas_WMDE>	 I’d like to deploy some backports ahead of the UTC afternoon window, if that’s okay
[09:36:55] <logmsgbot>	 !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: Install software version upgrade
[09:38:08] <claime>	 No objection from me
[09:38:24] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): wblistentityusage: Deprecate wbeu prefix, new output format [extensions/Wikibase] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914437 (https://phabricator.wikimedia.org/T300460)
[09:38:32] <claime>	 I'm migrating recommendation-api to mw-api-int in half an hour, but that shouldn't impact you I'd think
[09:41:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914297 (https://phabricator.wikimedia.org/T300460) (owner: 10Michael Große)
[09:41:21] <Lucas_WMDE>	 starting with this one then ^
[09:41:33] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: machinetranslation: Support configuration as env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/914721 (https://phabricator.wikimedia.org/T331505)
[09:41:34] <Lucas_WMDE>	 (gate-and-submit will take some time in case anyone wants to stop me ^^)
[09:41:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T335838)', diff saved to https://phabricator.wikimedia.org/P47324 and previous config saved to /var/cache/conftool/dbconfig/20230503-094135-ladsgroup.json
[09:41:35] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: machinetranslation: Add people to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/914722 (https://phabricator.wikimedia.org/T331505)
[09:42:03] <Lucas_WMDE>	 huh, there’s two GrowthExperiments changes in gate-and-submit-wmf?
[09:42:09] <wikibugs>	 (03Merged) 10jenkins-bot: Personalized praise: Run convertNumber() before displaying numbers [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914436 (https://phabricator.wikimedia.org/T322443) (owner: 10Urbanecm)
[09:42:11] <wikibugs>	 (03Merged) 10jenkins-bot: Personalized praise: Run convertNumber() before displaying numbers [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914435 (https://phabricator.wikimedia.org/T322443) (owner: 10Urbanecm)
[09:42:22] <Lucas_WMDE>	 urbanecm: can you ping me when you’re done?
[09:42:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Implementation of request flow - https://phabricator.wikimedia.org/T335474 (10SLyngshede-WMF) a:03SLyngshede-WMF
[09:42:40] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914436|Personalized praise: Run convertNumber() before displaying numbers (T322443)]], [[gerrit:914435|Personalized praise: Run convertNumber() before displaying numbers (T322443)]]
[09:42:43] <stashbot>	 T322443: Personalized praise: new mentor dashboard module - https://phabricator.wikimedia.org/T322443
[09:44:14] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 4:00:00 on db1110.eqiad.wmnet with reason: Moving to m3 T335092
[09:44:16] <stashbot>	 T335092: Move db1110 to m3 - https://phabricator.wikimedia.org/T335092
[09:44:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 4:00:00 on db1110.eqiad.wmnet with reason: Moving to m3 T335092
[09:46:33] <wikibugs>	 (03PS1) 10Marostegui: db1110: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/914723
[09:47:13] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-codfw
[09:47:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Support configuration as env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/914721 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris)
[09:47:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1110: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/914723 (owner: 10Marostegui)
[09:47:21] <wikibugs>	 (03PS13) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232)
[09:47:39] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad
[09:47:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[09:48:04] <wikibugs>	 (03Merged) 10jenkins-bot: machinetranslation: Support configuration as env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/914721 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris)
[09:49:34] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:914436|Personalized praise: Run convertNumber() before displaying numbers (T322443)]], [[gerrit:914435|Personalized praise: Run convertNumber() before displaying numbers (T322443)]] (duration: 06m 53s)
[09:49:37] <stashbot>	 T322443: Personalized praise: new mentor dashboard module - https://phabricator.wikimedia.org/T322443
[09:50:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 3%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47325 and previous config saved to /var/cache/conftool/dbconfig/20230503-095008-root.json
[09:50:24] <urbanecm>	 Lucas_WMDE: sorry, missed your message. I'm done now.
[09:50:30] <Lucas_WMDE>	 np, thanks!
[09:50:37] <Lucas_WMDE>	 I’m still waiting for gate-and-submit
[09:50:40] <urbanecm>	 ack
[09:51:27] <icinga-wm>	 PROBLEM - Host gitlab2003 is DOWN: PING CRITICAL - Packet loss = 100%
[09:52:32] <wikibugs>	 (03PS1) 10Filippo Giunchedi: ipmi: remove check_ipmi_sensor, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/914726 (https://phabricator.wikimedia.org/T332764)
[09:52:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ipmi: remove check_ipmi_sensor, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/914726 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[09:53:00] <jelto>	 ^ working on gitlab2003
[09:53:37] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1217.eqiad.wmnet with reason: Cloning db1110 from db1217:3323 T335092
[09:53:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Add people to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/914722 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris)
[09:53:40] <stashbot>	 T335092: Move db1110 to m3 - https://phabricator.wikimedia.org/T335092
[09:53:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1217.eqiad.wmnet with reason: Cloning db1110 from db1217:3323 T335092
[09:54:21] <wikibugs>	 (03Merged) 10jenkins-bot: machinetranslation: Add people to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/914722 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris)
[09:54:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41004/console" [puppet] - 10https://gerrit.wikimedia.org/r/914726 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[09:55:10] <wikibugs>	 (03PS2) 10Filippo Giunchedi: ipmi: remove check_ipmi_sensor, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/914726 (https://phabricator.wikimedia.org/T332764)
[09:55:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance
[09:55:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance
[09:56:08] <wikibugs>	 (03CR) 10Ayounsi: Replace most .format() to f-string (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597 (owner: 10Ayounsi)
[09:56:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P47327 and previous config saved to /var/cache/conftool/dbconfig/20230503-095641-ladsgroup.json
[09:57:07] <icinga-wm>	 RECOVERY - Host gitlab2003 is UP: PING OK - Packet loss = 0%, RTA = 34.59 ms
[09:57:33] <jynus>	 proxies, expected v
[09:57:36] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1110 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/914727 (https://phabricator.wikimedia.org/T335092)
[09:57:37] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:58:10] <wikibugs>	 (03PS1) 10Elukey: conftool-data: add config for the k8s ingress for ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914728 (https://phabricator.wikimedia.org/T335756)
[09:58:33] <icinga-wm>	 PROBLEM - SSH on gitlab2003 is CRITICAL: connect to address 208.80.153.52 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:58:41] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:58:42] <wikibugs>	 (03CR) 10Ayounsi: Add style checker and auto-formater to tox (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 (owner: 10Ayounsi)
[09:58:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance
[09:58:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance
[09:59:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T335838)', diff saved to https://phabricator.wikimedia.org/P47328 and previous config saved to /var/cache/conftool/dbconfig/20230503-095901-ladsgroup.json
[10:00:01] <wikibugs>	 (03Merged) 10jenkins-bot: wblistentityusage: Deprecate wbeu prefix, new output format [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914297 (https://phabricator.wikimedia.org/T300460) (owner: 10Michael Große)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1000)
[10:00:08] <wikibugs>	 (03PS3) 10Ayounsi: Replace most .format() to f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597
[10:00:10] <wikibugs>	 (03PS3) 10Ayounsi: Add style checker and auto-formater to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701
[10:00:12] <wikibugs>	 (03PS7) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570
[10:00:17] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder)
[10:00:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:914297|wblistentityusage: Deprecate wbeu prefix, new output format (T300460 T196962)]]
[10:00:37] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite2004.codfw.wmnet
[10:00:42] <stashbot>	 T196962: Module prefix 'wbeu' is shared between Wikibase\Client\Api\ApiListEntityUsage and Wikibase\Client\Api\ApiPropsEntityUsage - https://phabricator.wikimedia.org/T196962
[10:00:43] <stashbot>	 T300460: [API] WikibaseClient: wblistentityusage API module adds its results to the `query.pages` key in response - https://phabricator.wikimedia.org/T300460
[10:00:59] <Lucas_WMDE>	 ok, it merged
[10:01:12] <Lucas_WMDE>	 I was hoping this would finish before the deploy window :/
[10:01:16] <claime>	 No worries
[10:01:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi)
[10:01:41] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[10:02:41] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host arclamp1001.eqiad.wmnet
[10:02:59] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Replace most .format() to f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597 (owner: 10Ayounsi)
[10:03:18] <wikibugs>	 (03CR) 10Volans: "reply inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 (owner: 10Ayounsi)
[10:03:30] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host webperf1003.eqiad.wmnet
[10:03:56] <wikibugs>	 (03Merged) 10jenkins-bot: Replace most .format() to f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914597 (owner: 10Ayounsi)
[10:04:17] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[10:04:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T335838)', diff saved to https://phabricator.wikimedia.org/P47329 and previous config saved to /var/cache/conftool/dbconfig/20230503-100420-ladsgroup.json
[10:04:21] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[10:04:25] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[10:04:41] <icinga-wm>	 PROBLEM - Host gitlab2003 is DOWN: PING CRITICAL - Packet loss = 100%
[10:05:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 5%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47330 and previous config saved to /var/cache/conftool/dbconfig/20230503-100513-root.json
[10:05:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:05:24] <marostegui>	 haproxy alerts are expected
[10:06:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm couple of minor nits" [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff)
[10:06:41] <Lucas_WMDE>	 build-and-push-container-images is taking a while for me
[10:07:24] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1003.eqiad.wmnet
[10:07:27] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host webperf2003.codfw.wmnet
[10:07:33] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[10:07:37] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[10:07:42] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2004.codfw.wmnet
[10:07:43] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[10:08:11] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[10:08:21] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: machinetranslation: Fix egress dst_nets indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/914729
[10:09:01] <wikibugs>	 (03PS14) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232)
[10:09:13] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host arclamp1001.eqiad.wmnet
[10:09:36] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-eqiad
[10:09:57] <wikibugs>	 (03PS4) 10Ayounsi: Add style checker and auto-formater to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701
[10:09:59] <wikibugs>	 (03PS8) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570
[10:10:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Fix egress dst_nets indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/914729 (owner: 10Alexandros Kosiaris)
[10:10:18] <wikibugs>	 (03CR) 10Ayounsi: Add style checker and auto-formater to tox (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 (owner: 10Ayounsi)
[10:10:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:10:21] <icinga-wm>	 RECOVERY - Host gitlab2003 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms
[10:10:39] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2003.codfw.wmnet
[10:10:42] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host arclamp2001.codfw.wmnet
[10:10:46] <wikibugs>	 (03Merged) 10jenkins-bot: machinetranslation: Fix egress dst_nets indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/914729 (owner: 10Alexandros Kosiaris)
[10:10:51] <wikibugs>	 (03CR) 10Jbond: Jupyterhub-conda exclude /mnt from accessible paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene)
[10:11:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi)
[10:11:32] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I've not tested it but LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 (owner: 10Ayounsi)
[10:11:42] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] InitialiseSettings.php: Change termbox url for testwikidatawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914274 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[10:11:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P47331 and previous config saved to /var/cache/conftool/dbconfig/20230503-101147-ladsgroup.json
[10:12:30] <wikibugs>	 (03PS1) 10Elukey: Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T330414)
[10:12:53] <Lucas_WMDE>	 ah, my build-and-push-container-images finally finished
[10:12:57] <Lucas_WMDE>	 (10m58s o_O)
[10:13:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey)
[10:13:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/914404 (https://phabricator.wikimedia.org/T268344) (owner: 10JHathaway)
[10:13:46] <wikibugs>	 (03CR) 10Clément Goubert: InitialiseSettings.php: Change termbox url for testwikidatawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914274 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[10:13:49] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[10:14:06] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] InitialiseSettings.php: Change termbox url for testwikidatawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914274 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[10:14:14] <godog>	 jouncebot: now and next
[10:14:14] <jouncebot>	 For the next 0 hour(s) and 45 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1000)
[10:14:41] <Lucas_WMDE>	 I’m currently in the middle of a scap backport fyi
[10:14:50] <godog>	 thank you Lucas_WMDE ! yeah was about to ask
[10:15:01] <godog>	 I'll hold on to the graphite reboot, it can wait
[10:15:06] <Lucas_WMDE>	 not planning to backport any further changes after that though
[10:15:09] <Lucas_WMDE>	 I’ll ping you when I’m done
[10:15:25] <godog>	 cheers
[10:16:43] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on aphlict1001.eqiad.wmnet with reason: aphlict1002 is now active
[10:16:57] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on aphlict1001.eqiad.wmnet with reason: aphlict1002 is now active
[10:17:16] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host arclamp2001.codfw.wmnet
[10:17:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm.  fyi i did start converting theses to puppet functions[1] but need to refresh the work.  perhaps doing them in puppet was to optimis" [puppet] - 10https://gerrit.wikimedia.org/r/914406 (owner: 10JHathaway)
[10:17:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add style checker and auto-formater to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 (owner: 10Ayounsi)
[10:18:08] <wikibugs>	 (03Merged) 10jenkins-bot: Add style checker and auto-formater to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914701 (owner: 10Ayounsi)
[10:18:20] <wikibugs>	 (03CR) 10Stevemunene: Jupyterhub-conda exclude /mnt from accessible paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene)
[10:18:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] puppet7: re-add host core [puppet] - 10https://gerrit.wikimedia.org/r/914408 (owner: 10JHathaway)
[10:18:44] <logmsgbot>	 !log jelto@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host gitlab2003.wikimedia.org
[10:18:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and migr: Backport for [[gerrit:914297|wblistentityusage: Deprecate wbeu prefix, new output format (T300460 T196962)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[10:18:50] <stashbot>	 T196962: Module prefix 'wbeu' is shared between Wikibase\Client\Api\ApiListEntityUsage and Wikibase\Client\Api\ApiPropsEntityUsage - https://phabricator.wikimedia.org/T196962
[10:18:51] <stashbot>	 T300460: [API] WikibaseClient: wblistentityusage API module adds its results to the `query.pages` key in response - https://phabricator.wikimedia.org/T300460
[10:18:51] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox
[10:18:57] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox
[10:19:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P47332 and previous config saved to /var/cache/conftool/dbconfig/20230503-101926-ladsgroup.json
[10:19:27] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2003.wikimedia.org
[10:19:55] <wikibugs>	 (03PS2) 10Elukey: Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T330414)
[10:20:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 10%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47333 and previous config saved to /var/cache/conftool/dbconfig/20230503-102018-root.json
[10:20:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey)
[10:21:13] <wikibugs>	 (03PS1) 10Elukey: Add conftool and service config for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914735 (https://phabricator.wikimedia.org/T330414)
[10:21:22] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[10:22:06] <Lucas_WMDE>	 hm, it’s not really working as I would expect
[10:22:46] <wikibugs>	 (03PS1) 10Volans: Upstream release v3.2.9 with WMF modifications (2) [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/914736
[10:23:05] <icinga-wm>	 RECOVERY - SSH on gitlab2003 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:23:30] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job thanos-sidecar in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:23:32] <wikibugs>	 (03PS2) 10Volans: Upstream release v3.2.9 with WMF modifications (2) [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/914736
[10:24:32] <wikibugs>	 (03PS3) 10Elukey: Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T330414)
[10:24:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Upstream release v3.2.9 with WMF modifications (2) [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/914736 (owner: 10Volans)
[10:24:46] <Lucas_WMDE>	 wait, I think I’ve been testing the wrong API, nevermind
[10:25:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:25:53] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2003.wikimedia.org
[10:25:59] <Lucas_WMDE>	 yup, works as expected when I use list=wblistentityusage instead of prop=wbentityusage
[10:26:04] <Lucas_WMDE>	 syncing
[10:26:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T335838)', diff saved to https://phabricator.wikimedia.org/P47334 and previous config saved to /var/cache/conftool/dbconfig/20230503-102654-ladsgroup.json
[10:27:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[10:27:01] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] Upstream release v3.2.9 with WMF modifications (2) [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/914736 (owner: 10Volans)
[10:27:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[10:27:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T335838)', diff saved to https://phabricator.wikimedia.org/P47335 and previous config saved to /var/cache/conftool/dbconfig/20230503-102719-ladsgroup.json
[10:27:33] <wikibugs>	 (03PS2) 10Elukey: conftool-data: add config for the k8s ingress for ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914728 (https://phabricator.wikimedia.org/T335756)
[10:27:35] <wikibugs>	 (03PS2) 10Elukey: Add conftool and service config for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914735 (https://phabricator.wikimedia.org/T335756)
[10:27:41] <wikibugs>	 (03PS4) 10Elukey: Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T335756)
[10:28:21] <wikibugs>	 (03PS1) 10Hnowlan: admin_ng: increase thumbor resource limits, eqiad replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914737 (https://phabricator.wikimedia.org/T335681)
[10:28:24] <wikibugs>	 (03CR) 10Btullis: Jupyterhub-conda exclude /mnt from accessible paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene)
[10:28:29] <wikibugs>	 (03Restored) 10Btullis: Jupyterhub-conda exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene)
[10:29:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:29:41] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[10:29:49] <wikibugs>	 (03PS2) 10Hnowlan: admin_ng: increase thumbor resource limits, eqiad replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914737 (https://phabricator.wikimedia.org/T334488)
[10:30:19] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[10:32:26] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1004.eqiad.wmnet
[10:32:54] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:33:43] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet
[10:33:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T335838)', diff saved to https://phabricator.wikimedia.org/P47336 and previous config saved to /var/cache/conftool/dbconfig/20230503-103345-ladsgroup.json
[10:33:57] <logmsgbot>	 !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kubestagemaster2001.codfw.wmnet
[10:34:11] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet
[10:34:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:34:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P47337 and previous config saved to /var/cache/conftool/dbconfig/20230503-103433-ladsgroup.json
[10:35:23] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001
[10:35:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 25%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47338 and previous config saved to /var/cache/conftool/dbconfig/20230503-103523-root.json
[10:35:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:914297|wblistentityusage: Deprecate wbeu prefix, new output format (T300460 T196962)]] (duration: 34m 53s)
[10:35:30] <stashbot>	 T196962: Module prefix 'wbeu' is shared between Wikibase\Client\Api\ApiListEntityUsage and Wikibase\Client\Api\ApiPropsEntityUsage - https://phabricator.wikimedia.org/T196962
[10:35:31] <stashbot>	 T300460: [API] WikibaseClient: wblistentityusage API module adds its results to the `query.pages` key in response - https://phabricator.wikimedia.org/T300460
[10:35:34] * Lucas_WMDE done
[10:35:40] <Lucas_WMDE>	 godog: you’re good to go as far as I’m concerned
[10:36:42] <godog>	 Lucas_WMDE: thank you! appreciate it
[10:38:30] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1005.eqiad.wmnet
[10:38:58] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1004.eqiad.wmnet
[10:39:01] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2004.codfw.wmnet
[10:39:19] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] recommendation-api: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905943 (https://phabricator.wikimedia.org/T334062) (owner: 10Clément Goubert)
[10:40:15] <wikibugs>	 (03Merged) 10jenkins-bot: recommendation-api: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905943 (https://phabricator.wikimedia.org/T334062) (owner: 10Clément Goubert)
[10:40:26] <claime>	 !log Migrating recommendation-api staging to mw-api-int-async - T334062
[10:40:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:29] <stashbot>	 T334062: Migrate recommendation-api to mw-api-int - https://phabricator.wikimedia.org/T334062
[10:40:32] <claime>	 jouncebot: I'm loving that speedy CI
[10:40:38] <claime>	 _joe_*
[10:41:39] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001
[10:45:26] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2004.codfw.wmnet
[10:45:34] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1005.eqiad.wmnet
[10:45:38] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/recommendation-api: apply
[10:45:57] <logmsgbot>	 !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kubestagemaster2001.codfw.wmnet
[10:46:05] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:46:25] <claime>	 Checking
[10:47:41] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:47:49] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply
[10:48:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P47339 and previous config saved to /var/cache/conftool/dbconfig/20230503-104851-ladsgroup.json
[10:49:25] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/recommendation-api: apply
[10:49:31] <wikibugs>	 (03PS5) 10Muehlenhoff: apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308
[10:49:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T335838)', diff saved to https://phabricator.wikimedia.org/P47340 and previous config saved to /var/cache/conftool/dbconfig/20230503-104939-ladsgroup.json
[10:49:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance
[10:49:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff)
[10:49:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance
[10:50:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T335838)', diff saved to https://phabricator.wikimedia.org/P47341 and previous config saved to /var/cache/conftool/dbconfig/20230503-105004-ladsgroup.json
[10:50:10] <claime>	 !log Migrating recommendation-api codfw to mw-api-int-async - T334062
[10:50:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:13] <stashbot>	 T334062: Migrate recommendation-api to mw-api-int - https://phabricator.wikimedia.org/T334062
[10:50:15] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply
[10:50:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 50%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47342 and previous config saved to /var/cache/conftool/dbconfig/20230503-105028-root.json
[10:51:32] <claime>	 !log Migrating recommendation-api eqiad to mw-api-int-async - T334062
[10:51:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:38] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply
[10:52:10] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply
[10:53:43] <wikibugs>	 (03PS6) 10Muehlenhoff: apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308
[10:54:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff)
[10:55:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1110 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/914727 (https://phabricator.wikimedia.org/T335092) (owner: 10Marostegui)
[10:56:04] <wikibugs>	 (03PS15) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232)
[10:56:34] <wikibugs>	 (03PS1) 10Hashar: tox: do not skip missing interpreters on CI [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914746
[10:56:36] <wikibugs>	 (03PS1) 10Hashar: tox: use default python for local testing [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914747
[10:56:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T335838)', diff saved to https://phabricator.wikimedia.org/P47343 and previous config saved to /var/cache/conftool/dbconfig/20230503-105639-ladsgroup.json
[10:57:35] <wikibugs>	 (03CR) 10Hashar: "I came up with that pattern in Quibble https://gerrit.wikimedia.org/r/c/integration/quibble/+/607512  so I can just `tox` locally and do n" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914747 (owner: 10Hashar)
[10:57:39] <wikibugs>	 (03PS1) 10EoghanGaffney: [gitlab/failover] Add rollback method [cookbooks] - 10https://gerrit.wikimedia.org/r/914748
[10:57:42] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster1001.eqiad.wmnet
[11:00:03] <jelto>	 GitLab needs a short maintenance break in one hour (12:00 UTC). For around 15 minutes GitLab, GitLab CI and most probably Phabricator will not be available
[11:00:30] <claime>	 ack
[11:00:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [gitlab/failover] Add rollback method [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 (owner: 10EoghanGaffney)
[11:00:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbproxy[2001-2004].codfw.wmnet with reason: Reboot T335845
[11:01:09] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy[2001-2004].codfw.wmnet with reason: Reboot T335845
[11:01:10] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759)
[11:02:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez)
[11:02:52] <marostegui>	 !log Reboot dbproxy200[1-4]
[11:02:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:07] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Make wbsubscribers API output sensible on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458)
[11:03:45] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759)
[11:03:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P47344 and previous config saved to /var/cache/conftool/dbconfig/20230503-110357-ladsgroup.json
[11:04:33] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert)
[11:04:44] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert)
[11:05:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 75%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47345 and previous config saved to /var/cache/conftool/dbconfig/20230503-110532-root.json
[11:05:35] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "diffConfig looks as expected to me – testwikidatawiki true, wikidatawiki false \o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE))
[11:06:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10ERayfield) my bad, sorry! didn't know that the sig had not gone through - thanks @JKieserman !
[11:06:31] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster1001.eqiad.wmnet
[11:07:09] <wikibugs>	 (03PS9) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570
[11:07:11] <wikibugs>	 (03PS1) 10Ayounsi: Fix multiple pylint inconsistencies [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914753
[11:07:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:07:50] <wikibugs>	 (03PS1) 10Hashar: Checkout tested patch in a branch [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914754
[11:08:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi)
[11:08:30] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "-1: some minor issues see inline" [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[11:08:50] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Install software version upgrade
[11:09:56] <wikibugs>	 (03CR) 10Ayounsi: "Messages Found: 252 with c4d44ff08b50edc6508894a0e444e4088474b335" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi)
[11:10:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Checkout tested patch in a branch [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914754 (owner: 10Hashar)
[11:10:32] <wikibugs>	 (03CR) 10Ayounsi: "down to ~250." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914753 (owner: 10Ayounsi)
[11:10:47] <wikibugs>	 (03PS1) 10Volans: python_deploy: set the setgid bit on the git clone [puppet] - 10https://gerrit.wikimedia.org/r/914755
[11:11:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P47346 and previous config saved to /var/cache/conftool/dbconfig/20230503-111145-ladsgroup.json
[11:11:49] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-codfw
[11:11:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbproxy1014.eqiad.wmnet with reason: Upgrade
[11:11:55] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] python_deploy: set the setgid bit on the git clone [puppet] - 10https://gerrit.wikimedia.org/r/914755 (owner: 10Volans)
[11:11:57] <wikibugs>	 (03PS1) 10Majavah: build: format scripts/ with black too [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914756
[11:11:59] <wikibugs>	 (03PS1) 10Majavah: webservice: set argparse help program correctly in toolforge-cli [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914757
[11:12:03] <wikibugs>	 (03PS1) 10Majavah: debian: provision toolforge-webservice symlink [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914758
[11:12:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy1014.eqiad.wmnet with reason: Upgrade
[11:12:10] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbproxy1015.eqiad.wmnet with reason: Upgrade
[11:12:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:12:23] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy1015.eqiad.wmnet with reason: Upgrade
[11:12:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] webservice: set argparse help program correctly in toolforge-cli [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914757 (owner: 10Majavah)
[11:13:14] <wikibugs>	 (03PS2) 10Majavah: webservice: set argparse help program correctly in toolforge-cli [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914757
[11:13:16] <wikibugs>	 (03PS2) 10Majavah: debian: provision toolforge-webservice symlink [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914758
[11:13:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbproxy1016.eqiad.wmnet with reason: Upgrade
[11:13:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy1016.eqiad.wmnet with reason: Upgrade
[11:13:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbproxy1017.eqiad.wmnet with reason: Upgrade
[11:13:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy1017.eqiad.wmnet with reason: Upgrade
[11:14:05] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:15:53] <wikibugs>	 (03PS1) 10Majavah: webservice: Improve --buildservice-image help message [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914759
[11:17:25] <wikibugs>	 (03CR) 10Volans: [C: 03+2] python_deploy: set the setgid bit on the git clone [puppet] - 10https://gerrit.wikimedia.org/r/914755 (owner: 10Volans)
[11:18:16] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001
[11:19:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Decom db1111 T335836', diff saved to https://phabricator.wikimedia.org/P47347 and previous config saved to /var/cache/conftool/dbconfig/20230503-111904-ladsgroup.json
[11:19:07] <stashbot>	 T335836: decommission db1111.eqiad.wmnet - https://phabricator.wikimedia.org/T335836
[11:19:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T335838)', diff saved to https://phabricator.wikimedia.org/P47348 and previous config saved to /var/cache/conftool/dbconfig/20230503-111910-ladsgroup.json
[11:19:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[11:19:23] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:19:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[11:19:31] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[11:20:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 100%: Repooling after migrating', diff saved to https://phabricator.wikimedia.org/P47349 and previous config saved to /var/cache/conftool/dbconfig/20230503-112037-root.json
[11:20:48] <wikibugs>	 (03PS1) 10Ladsgroup: conftool-data: Remove db1111 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/914760 (https://phabricator.wikimedia.org/T335836)
[11:21:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] conftool-data: Remove db1111 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/914760 (https://phabricator.wikimedia.org/T335836) (owner: 10Ladsgroup)
[11:22:31] <wikibugs>	 (03PS2) 10Ladsgroup: conftool-data: Remove db1111 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/914760 (https://phabricator.wikimedia.org/T335836)
[11:22:36] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] conftool-data: Remove db1111 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/914760 (https://phabricator.wikimedia.org/T335836) (owner: 10Ladsgroup)
[11:23:07] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001
[11:23:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance
[11:24:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance
[11:24:33] <wikibugs>	 (03PS7) 10Jbond: apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff)
[11:25:39] <volans>	 elukey: I think the uncommitted dns changes are yours... k8s-ingress-ml-staging
[11:25:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/914726 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[11:26:39] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:26:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[11:26:44] <volans>	 also the .83 IP was not reserved on the eqiad prefix elukey 
[11:26:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[11:27:17] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001
[11:27:36] <wikibugs>	 (03CR) 10Jbond: jupyterhub-conda: Fix incompatibility with HDFS-FUSE mount (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene)
[11:28:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Remove db1111 from dbctl T335836', diff saved to https://phabricator.wikimedia.org/P47350 and previous config saved to /var/cache/conftool/dbconfig/20230503-112812-ladsgroup.json
[11:28:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene)
[11:28:16] <stashbot>	 T335836: decommission db1111.eqiad.wmnet - https://phabricator.wikimedia.org/T335836
[11:28:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T335838)', diff saved to https://phabricator.wikimedia.org/P47351 and previous config saved to /var/cache/conftool/dbconfig/20230503-112819-ladsgroup.json
[11:28:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2151 (T335838)', diff saved to https://phabricator.wikimedia.org/P47352 and previous config saved to /var/cache/conftool/dbconfig/20230503-112819-ladsgroup.json
[11:28:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P47353 and previous config saved to /var/cache/conftool/dbconfig/20230503-112828-ladsgroup.json
[11:28:49] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001
[11:31:37] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001
[11:31:41] <wikibugs>	 (03CR) 10Jbond: "pcc looks strange:" [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff)
[11:33:25] <wikibugs>	 (03PS8) 10Muehlenhoff: apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308
[11:34:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T335838)', diff saved to https://phabricator.wikimedia.org/P47354 and previous config saved to /var/cache/conftool/dbconfig/20230503-113441-ladsgroup.json
[11:35:20] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Remove puppet entries for db1111 [puppet] - 10https://gerrit.wikimedia.org/r/914762 (https://phabricator.wikimedia.org/T335836)
[11:35:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T335838)', diff saved to https://phabricator.wikimedia.org/P47355 and previous config saved to /var/cache/conftool/dbconfig/20230503-113524-ladsgroup.json
[11:38:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Remove puppet entries for db1111 [puppet] - 10https://gerrit.wikimedia.org/r/914762 (https://phabricator.wikimedia.org/T335836) (owner: 10Ladsgroup)
[11:38:46] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes2015.codfw.wmnet
[11:38:47] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2015.codfw.wmnet
[11:38:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914746 (owner: 10Hashar)
[11:39:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914747 (owner: 10Hashar)
[11:40:04] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001
[11:40:41] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:42:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff)
[11:43:13] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff)
[11:43:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T335838)', diff saved to https://phabricator.wikimedia.org/P47356 and previous config saved to /var/cache/conftool/dbconfig/20230503-114335-ladsgroup.json
[11:43:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance
[11:44:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance
[11:44:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[11:44:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[11:44:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T335838)', diff saved to https://phabricator.wikimedia.org/P47357 and previous config saved to /var/cache/conftool/dbconfig/20230503-114426-ladsgroup.json
[11:44:34] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff)
[11:44:36] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] mariadb: Remove puppet entries for db1111 [puppet] - 10https://gerrit.wikimedia.org/r/914762 (https://phabricator.wikimedia.org/T335836) (owner: 10Ladsgroup)
[11:44:41] <wikibugs>	 (03CR) 10Jbond: "lgtm just needs the style fixing e.g.  ./utils/check-style.sh" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914754 (owner: 10Hashar)
[11:47:23] <wikibugs>	 (03PS5) 10Slyngshede: Requisition approval functionality. [software/bitu] - 10https://gerrit.wikimedia.org/r/911249
[11:49:31] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db1110 [puppet] - 10https://gerrit.wikimedia.org/r/914765 (https://phabricator.wikimedia.org/T335011)
[11:49:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P47358 and previous config saved to /var/cache/conftool/dbconfig/20230503-114947-ladsgroup.json
[11:49:59] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:50:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1110 [puppet] - 10https://gerrit.wikimedia.org/r/914765 (https://phabricator.wikimedia.org/T335011) (owner: 10Marostegui)
[11:50:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P47359 and previous config saved to /var/cache/conftool/dbconfig/20230503-115030-ladsgroup.json
[11:50:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:51:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1110 from dbctl T335011', diff saved to https://phabricator.wikimedia.org/P47360 and previous config saved to /var/cache/conftool/dbconfig/20230503-115124-marostegui.json
[11:51:27] <stashbot>	 T335011: decommission db1110.eqiad.wmnet - https://phabricator.wikimedia.org/T335011
[11:51:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T335838)', diff saved to https://phabricator.wikimedia.org/P47361 and previous config saved to /var/cache/conftool/dbconfig/20230503-115130-ladsgroup.json
[11:51:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1111.eqiad.wmnet
[11:55:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:56:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff)
[11:56:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox
[12:02:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1111.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001"
[12:04:51] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: machinetranslation: Use 2023-05-03-104124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914767 (https://phabricator.wikimedia.org/T331505)
[12:04:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P47362 and previous config saved to /var/cache/conftool/dbconfig/20230503-120453-ladsgroup.json
[12:04:56] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloud_private_subnet: support using FQDNs instead of hardcoded IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760)
[12:05:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Use 2023-05-03-104124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914767 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris)
[12:05:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P47363 and previous config saved to /var/cache/conftool/dbconfig/20230503-120536-ladsgroup.json
[12:06:19] <wikibugs>	 (03Merged) 10jenkins-bot: machinetranslation: Use 2023-05-03-104124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914767 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris)
[12:06:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1111.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001"
[12:06:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:06:34] <logmsgbot>	 !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: Install software version upgrade
[12:06:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1111.eqiad.wmnet
[12:06:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P47364 and previous config saved to /var/cache/conftool/dbconfig/20230503-120637-ladsgroup.json
[12:06:52] <icinga-wm>	 PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 3321 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[12:07:05] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloud_private_subnet: support using FQDNs instead of hardcoded IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760)
[12:07:24] <jinxer-wm>	 (ProbeDown) firing: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:07:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:08:02] <icinga-wm>	 RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 145034 bytes in 1.933 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[12:08:28] <godog>	 jouncebot: now and next
[12:08:28] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[12:08:28] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 51 minute(s)
[12:09:04] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog1002.eqiad.wmnet
[12:09:11] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org
[12:10:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] ipmi: remove check_ipmi_sensor, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/914726 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[12:10:43] <taavi>	 phabricator seems to be down again due to gitlab being down
[12:11:30] <Amir1>	 !log Removing db1111 from zarcillo T335836
[12:11:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:34] <stashbot>	 T335836: decommission db1111.eqiad.wmnet - https://phabricator.wikimedia.org/T335836
[12:11:58] <jelto>	 taavi: yes that's expected, see my message from 11:00 here or in gitlab/releng channel
[12:12:10] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:12:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:12:42] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:12:54] <icinga-wm>	 PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 3321 bytes in 7.224 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[12:13:56] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:14:02] <icinga-wm>	 RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 144129 bytes in 2.203 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[12:14:42] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:14:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] apt::repository: Don't assume all repository keys are ASCII-armored [puppet] - 10https://gerrit.wikimedia.org/r/914308 (owner: 10Muehlenhoff)
[12:15:17] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1002.eqiad.wmnet
[12:16:35] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org
[12:17:19] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "I think one of the reasons for / being a readonly path for jupyter is to prevent some kind of writable access if a malicious actor was abl" [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene)
[12:17:24] <jinxer-wm>	 (ProbeDown) resolved: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:19:31] <wikibugs>	 (03CR) 10JMeybohm: "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/914275 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[12:20:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T335838)', diff saved to https://phabricator.wikimedia.org/P47365 and previous config saved to /var/cache/conftool/dbconfig/20230503-122000-ladsgroup.json
[12:20:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[12:20:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[12:20:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[12:20:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[12:20:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T335838)', diff saved to https://phabricator.wikimedia.org/P47366 and previous config saved to /var/cache/conftool/dbconfig/20230503-122040-ladsgroup.json
[12:20:41] <wikibugs>	 (03PS1) 10Ottomata: eventgate - bump image version to pick up new schemas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914768 (https://phabricator.wikimedia.org/T331401)
[12:20:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T335838)', diff saved to https://phabricator.wikimedia.org/P47367 and previous config saved to /var/cache/conftool/dbconfig/20230503-122049-ladsgroup.json
[12:20:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[12:21:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove unused labstore code [puppet] - 10https://gerrit.wikimedia.org/r/914415 (owner: 10Andrew Bogott)
[12:21:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[12:21:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T335838)', diff saved to https://phabricator.wikimedia.org/P47368 and previous config saved to /var/cache/conftool/dbconfig/20230503-122113-ladsgroup.json
[12:21:21] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - bump image version to pick up new schemas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914768 (https://phabricator.wikimedia.org/T331401) (owner: 10Ottomata)
[12:21:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P47369 and previous config saved to /var/cache/conftool/dbconfig/20230503-122143-ladsgroup.json
[12:22:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] modules: add ml-staging cfg to the istio template [deployment-charts] - 10https://gerrit.wikimedia.org/r/914307 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[12:22:49] <Amir1>	 https://phabricator.wikimedia.org/T335836 I can't open this but I can open rest of phabricator tickets
[12:22:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] modules: duplicate the istio ingress template for 1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914306 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[12:23:06] <Amir1>	 now I can open it
[12:23:08] <Amir1>	 anyway
[12:23:12] <wikibugs>	 (03PS2) 10JMeybohm: admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491)
[12:23:24] <marostegui>	 Amir1: yes, there's maintenance on gitlab per jelto's comment on -sre
[12:23:52] <Amir1>	 thanks
[12:24:06] <jelto>	 phabricator should recover but it seems there is some caching and it needs some time
[12:24:09] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[12:24:24] <jelto>	 at least all of my tasks are loading again now
[12:24:25] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[12:25:07] <wikibugs>	 (03PS1) 10Ottomata: eventgate-main - bump image version to pick up new schemas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914769 (https://phabricator.wikimedia.org/T331401)
[12:25:29] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
[12:26:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add conftool and service config for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914735 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[12:26:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cloud_private_subnet: support using FQDNs instead of hardcoded IP addresses (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[12:27:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T335838)', diff saved to https://phabricator.wikimedia.org/P47370 and previous config saved to /var/cache/conftool/dbconfig/20230503-122705-ladsgroup.json
[12:27:50] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[12:28:04] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate-main - bump image version to pick up new schemas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914769 (https://phabricator.wikimedia.org/T331401) (owner: 10Ottomata)
[12:28:11] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1111.eqiad.wmnet - https://phabricator.wikimedia.org/T335836 (10Ladsgroup) Awesome. Thanks!
[12:28:31] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[12:30:13] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914756 (owner: 10Majavah)
[12:30:33] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914757 (owner: 10Majavah)
[12:31:05] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914758 (owner: 10Majavah)
[12:31:17] <godog>	 jouncebot: next
[12:31:17] <jouncebot>	 In 0 hour(s) and 28 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1300)
[12:31:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T335838)', diff saved to https://phabricator.wikimedia.org/P47371 and previous config saved to /var/cache/conftool/dbconfig/20230503-123137-ladsgroup.json
[12:31:38] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914759 (owner: 10Majavah)
[12:31:39] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet
[12:32:09] <wikibugs>	 (03PS3) 10Muehlenhoff: Use signed-by to in apt::package_from_component on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495)
[12:32:13] <wikibugs>	 (03CR) 10Muehlenhoff: Use signed-by to in apt::package_from_component on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[12:32:56] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:33:47] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] build: format scripts/ with black too [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914756 (owner: 10Majavah)
[12:33:52] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] webservice: set argparse help program correctly in toolforge-cli [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914757 (owner: 10Majavah)
[12:33:56] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] debian: provision toolforge-webservice symlink [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914758 (owner: 10Majavah)
[12:33:58] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] webservice: Improve --buildservice-image help message [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914759 (owner: 10Majavah)
[12:34:12] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:34:32] <wikibugs>	 (03Merged) 10jenkins-bot: build: format scripts/ with black too [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914756 (owner: 10Majavah)
[12:34:35] <wikibugs>	 (03Merged) 10jenkins-bot: webservice: set argparse help program correctly in toolforge-cli [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914757 (owner: 10Majavah)
[12:34:44] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:35:24] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[12:35:41] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
[12:35:42] <wikibugs>	 (03Merged) 10jenkins-bot: debian: provision toolforge-webservice symlink [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914758 (owner: 10Majavah)
[12:35:46] <wikibugs>	 (03Merged) 10jenkins-bot: webservice: Improve --buildservice-image help message [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914759 (owner: 10Majavah)
[12:36:28] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner
[12:36:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T335838)', diff saved to https://phabricator.wikimedia.org/P47372 and previous config saved to /var/cache/conftool/dbconfig/20230503-123649-ladsgroup.json
[12:36:54] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:36:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance
[12:37:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance
[12:37:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47373 and previous config saved to /var/cache/conftool/dbconfig/20230503-123714-ladsgroup.json
[12:37:34] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:37:54] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet
[12:38:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47374 and previous config saved to /var/cache/conftool/dbconfig/20230503-123837-ladsgroup.json
[12:42:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P47375 and previous config saved to /var/cache/conftool/dbconfig/20230503-124212-ladsgroup.json
[12:44:14] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] "k8s node restarts are happening in codfw now so I have to wait a bit to deploy this..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/914768 (https://phabricator.wikimedia.org/T331401) (owner: 10Ottomata)
[12:45:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47376 and previous config saved to /var/cache/conftool/dbconfig/20230503-124558-ladsgroup.json
[12:46:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P47377 and previous config saved to /var/cache/conftool/dbconfig/20230503-124643-ladsgroup.json
[12:47:19] <wikibugs>	 (03PS1) 10Andrew Bogott: toolschecker: update list of expected etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/914771
[12:47:31] <wikibugs>	 (03PS6) 10Slyngshede: Requisition approval functionality. [software/bitu] - 10https://gerrit.wikimedia.org/r/911249
[12:47:46] <wikibugs>	 (03CR) 10Kamila Součková: "LGTM, but I don't have enough context to actually feel okay +1'ing this '^^" [deployment-charts] - 10https://gerrit.wikimedia.org/r/914737 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[12:48:34] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760)
[12:48:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolschecker: update list of expected etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/914771 (owner: 10Andrew Bogott)
[12:49:02] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] toolschecker: update list of expected etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/914771 (owner: 10Andrew Bogott)
[12:49:52] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] toolschecker: update list of expected etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/914771 (owner: 10Andrew Bogott)
[12:50:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[12:53:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove router access for cmjohnson [homer/public] - 10https://gerrit.wikimedia.org/r/914260 (owner: 10Muehlenhoff)
[12:55:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-dns-floating-ip-updater: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914412 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[12:55:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[12:57:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P47378 and previous config saved to /var/cache/conftool/dbconfig/20230503-125718-ladsgroup.json
[12:58:14] <wikibugs>	 (03PS2) 10Jbond: get_config: add specific get_config script for puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/912949
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1300).
[13:00:05] <jouncebot>	 MichaelG_WMDE, subbu, and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:12] <Lucas_WMDE>	 o/
[13:00:16] <subbu>	 o/
[13:00:18] <Lucas_WMDE>	 I can deploy!
[13:00:34] <MichaelG_WMDE>	 hi
[13:01:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P47379 and previous config saved to /var/cache/conftool/dbconfig/20230503-130105-ladsgroup.json
[13:01:17] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): testwikidatawiki: enable entity labels in parsed API edit summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912309 (https://phabricator.wikimedia.org/T335098) (owner: 10Michael Große)
[13:01:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912309 (https://phabricator.wikimedia.org/T335098) (owner: 10Michael Große)
[13:01:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P47380 and previous config saved to /var/cache/conftool/dbconfig/20230503-130149-ladsgroup.json
[13:02:27] <wikibugs>	 (03Merged) 10jenkins-bot: testwikidatawiki: enable entity labels in parsed API edit summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912309 (https://phabricator.wikimedia.org/T335098) (owner: 10Michael Große)
[13:02:56] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:912309|testwikidatawiki: enable entity labels in parsed API edit summaries (T335098)]]
[13:02:59] <stashbot>	 T335098: Testwikidata: enable entity labels in parsed edit summaries in API requests - https://phabricator.wikimedia.org/T335098
[13:04:53] <wikibugs>	 (03PS1) 10Jaime Nuche: beta: delete old files regularly from Puppet client bucket on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/914777
[13:05:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and migr: Backport for [[gerrit:912309|testwikidatawiki: enable entity labels in parsed API edit summaries (T335098)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[13:05:01] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "let’s start the gate-and-submit already" [extensions/Wikibase] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914437 (https://phabricator.wikimedia.org/T300460) (owner: 10Lucas Werkmeister (WMDE))
[13:05:59] <Lucas_WMDE>	 hm, I don’t see a difference on https://test.wikidata.org/w/api.php?action=query&format=json&list=recentchanges&formatversion=2&rcnamespace=0&rcprop=parsedcomment
[13:06:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] get_config: add specific get_config script for puppet7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912949 (owner: 10Jbond)
[13:06:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nfs-exportd.py: remove some dead code [puppet] - 10https://gerrit.wikimedia.org/r/913962 (owner: 10Andrew Bogott)
[13:07:17] <Lucas_WMDE>	 ah, but on https://test.wikidata.org/w/api.php?action=query&format=json&prop=revisions&revids=636981&formatversion=2&rvprop=comment|parsedcomment it works
[13:07:33] <Lucas_WMDE>	 does list=recentchanges not work the same way? o_O
[13:07:37] <Lucas_WMDE>	 but good to deploy for now, I think
[13:07:39] <Lucas_WMDE>	 syncing
[13:07:50] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:08:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Use signed-by to in apt::package_from_component on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913132 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[13:08:55] <MichaelG_WMDE>	 can confirm that it works on the debug server
[13:09:14] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[13:09:25] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[13:09:36] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[13:09:53] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10lojo)
[13:10:48] <wikibugs>	 (03PS3) 10Hashar: Use same php version for doc and integration websites [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357)
[13:11:07] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[13:12:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T335838)', diff saved to https://phabricator.wikimedia.org/P47381 and previous config saved to /var/cache/conftool/dbconfig/20230503-131224-ladsgroup.json
[13:12:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[13:12:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[13:12:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T335838)', diff saved to https://phabricator.wikimedia.org/P47382 and previous config saved to /var/cache/conftool/dbconfig/20230503-131249-ladsgroup.json
[13:12:58] <wikibugs>	 (03PS1) 10Ottomata: page_content_change - bump image to v1.15.0-dev0 to debug OOM [deployment-charts] - 10https://gerrit.wikimedia.org/r/914780 (https://phabricator.wikimedia.org/T332948)
[13:13:28] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert)
[13:13:36] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-1] "Holding for discussion on whether merging staging-test and staging is good idea." [deployment-charts] - 10https://gerrit.wikimedia.org/r/914275 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[13:13:43] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] page_content_change - bump image to v1.15.0-dev0 to debug OOM [deployment-charts] - 10https://gerrit.wikimedia.org/r/914780 (https://phabricator.wikimedia.org/T332948) (owner: 10Ottomata)
[13:14:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T335838)', diff saved to https://phabricator.wikimedia.org/P47383 and previous config saved to /var/cache/conftool/dbconfig/20230503-131414-ladsgroup.json
[13:16:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P47384 and previous config saved to /var/cache/conftool/dbconfig/20230503-131611-ladsgroup.json
[13:16:17] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[13:16:27] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[13:16:52] <Lucas_WMDE>	 “Finished Running helmfile -e codfw --selector name=canary apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 07m 26s) ” o_O
[13:16:54] <Lucas_WMDE>	 7½ minutes…
[13:16:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T335838)', diff saved to https://phabricator.wikimedia.org/P47385 and previous config saved to /var/cache/conftool/dbconfig/20230503-131656-ladsgroup.json
[13:17:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[13:17:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[13:17:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[13:17:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[13:17:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T335838)', diff saved to https://phabricator.wikimedia.org/P47386 and previous config saved to /var/cache/conftool/dbconfig/20230503-131736-ladsgroup.json
[13:18:14] <claime>	 Lucas_WMDE: I'm rebooting kubernetes nodes in codfw, that's probably why
[13:18:16] <wikibugs>	 (03CR) 10Hashar: "PPC https://puppet-compiler.wmflabs.org/output/914731/1780/" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[13:18:20] <Lucas_WMDE>	 ah ok
[13:18:48] <wikibugs>	 (03CR) 10Muehlenhoff: Use signed-by to include the Wikimedia repo starting with Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[13:18:51] <Lucas_WMDE>	 it went faster with the non-canary apply at least
[13:18:54] <wikibugs>	 (03PS5) 10Muehlenhoff: Use signed-by to include the Wikimedia repo starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495)
[13:19:04] <Lucas_WMDE>	 e.g. “Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 00m 26s)”
[13:19:14] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner
[13:19:52] <claime>	 The scheduler probably sent a canary pod to a node that didn't have the image yet
[13:20:11] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[13:20:13] <Lucas_WMDE>	 oh right, and then it takes a while to download
[13:20:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T335838)', diff saved to https://phabricator.wikimedia.org/P47387 and previous config saved to /var/cache/conftool/dbconfig/20230503-132022-ladsgroup.json
[13:20:25] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] "looks reasonable, but probably depends on Ib07b2acdf, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE))
[13:20:40] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:20:52] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:912309|testwikidatawiki: enable entity labels in parsed API edit summaries (T335098)]] (duration: 17m 55s)
[13:20:54] <stashbot>	 T335098: Testwikidata: enable entity labels in parsed edit summaries in API requests - https://phabricator.wikimedia.org/T335098
[13:20:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914437 (https://phabricator.wikimedia.org/T300460) (owner: 10Lucas Werkmeister (WMDE))
[13:21:14] <claime>	 Yeah, the mediawiki image is a tad on the heavy side
[13:21:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:21:45] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Make wbsubscribers API output sensible on Test Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE))
[13:22:54] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:23:03] <claime>	 XioNoX: Can that BGP alert be because of the reboots?
[13:23:10] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Make wbsubscribers API output sensible on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458)
[13:23:13] <claime>	 (and in that case should I downtime it for the duration)
[13:23:18] <wikibugs>	 (03CR) 10Herron: [C: 03+1] prometheus: Synchronize only the /srv/prometheus directory when migrating data [puppet] - 10https://gerrit.wikimedia.org/r/914400 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse)
[13:23:21] <claime>	 s/can/should/
[13:23:29] <XioNoX>	 claime: yep, looks like it
[13:23:42] <XioNoX>	 (in meeting)
[13:23:46] <claime>	 ack
[13:23:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T335838)', diff saved to https://phabricator.wikimedia.org/P47388 and previous config saved to /var/cache/conftool/dbconfig/20230503-132349-ladsgroup.json
[13:23:57] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] Make wbsubscribers API output sensible on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE))
[13:23:59] <wikibugs>	 (03Merged) 10jenkins-bot: wblistentityusage: Deprecate wbeu prefix, new output format [extensions/Wikibase] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914437 (https://phabricator.wikimedia.org/T300460) (owner: 10Lucas Werkmeister (WMDE))
[13:24:12] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[13:24:15] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[13:24:20] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:24:30] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:914437|wblistentityusage: Deprecate wbeu prefix, new output format (T300460 T196962)]]
[13:24:35] <stashbot>	 T196962: Module prefix 'wbeu' is shared between Wikibase\Client\Api\ApiListEntityUsage and Wikibase\Client\Api\ApiPropsEntityUsage - https://phabricator.wikimedia.org/T196962
[13:24:35] <stashbot>	 T300460: [API] WikibaseClient: wblistentityusage API module adds its results to the `query.pages` key in response - https://phabricator.wikimedia.org/T300460
[13:24:45] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Should be testable here: https://test.wikidata.org/w/api.php?action=query&list=wbsubscribers&wblsentities=Q11" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE))
[13:26:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:26:37] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafkamon1003.eqiad.wmnet
[13:26:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-webproxy: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914416 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[13:27:33] <wikibugs>	 (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/914777/41013/" [puppet] - 10https://gerrit.wikimedia.org/r/914777 (owner: 10Jaime Nuche)
[13:30:00] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-wikireplica-dns: use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914417 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[13:30:02] <Lucas_WMDE>	 build-and-push-container-images is taking its time again
[13:30:34] <Lucas_WMDE>	 6 patches per deployment window feels pretty optimistic these days
[13:31:05] <Lucas_WMDE>	 (maybe it’ll get faster again once we only deploy to k8s? fingers crossed)
[13:31:08] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:31:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47389 and previous config saved to /var/cache/conftool/dbconfig/20230503-133117-ladsgroup.json
[13:32:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47390 and previous config saved to /var/cache/conftool/dbconfig/20230503-133232-ladsgroup.json
[13:33:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-enc-cli:  use os-cloud section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914418 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[13:33:52] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org
[13:34:05] <logmsgbot>	 !log slyngshede@cumin1001 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM idm-test1001.wikimedia.org
[13:34:27] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org
[13:34:45] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[13:35:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P47391 and previous config saved to /var/cache/conftool/dbconfig/20230503-133528-ladsgroup.json
[13:35:46] <logmsgbot>	 !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kafkamon1003.eqiad.wmnet
[13:35:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs notify_maintainers: use mwopenstackclients for keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/914419 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[13:36:03] <logmsgbot>	 !log slyngshede@cumin1001 END (ERROR) - Cookbook sre.ganeti.reboot-vm (exit_code=97) for VM idm-test1001.wikimedia.org
[13:36:11] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm2001.wikimedia.org
[13:37:04] <icinga-wm>	 PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service,burrow-logging-eqiad.service,burrow-main-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:38:48] <icinga-wm>	 RECOVERY - Host db2184 is UP: PING OK - Packet loss = 0%, RTA = 35.61 ms
[13:38:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P47392 and previous config saved to /var/cache/conftool/dbconfig/20230503-133855-ladsgroup.json
[13:39:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Use signed-by to include the Wikimedia repo starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[13:39:11] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: cookbooks.sre.ganeti.reimage: failure reported when first puppet run succeeds after a retry - https://phabricator.wikimedia.org/T335863 (10herron)
[13:39:15] <Lucas_WMDE>	 subbu: fyi I’m planning to skip ahead to your config change once the current scap is done, so you don’t have to wait through the rest of the Wikidata changes
[13:39:24] <Lucas_WMDE>	 (I assume we’ll overrun the window)
[13:39:40] <Lucas_WMDE>	 oh dear, no, there’s LVS maintenance right after it /o\
[13:39:58] * Lucas_WMDE is used to the luxury of several free hours after the UTC afternoon backport window :D
[13:40:04] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm2001.wikimedia.org
[13:40:13] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm1001.wikimedia.org
[13:40:15] <subbu>	 sounds good!
[13:40:15] <Lucas_WMDE>	 then the rest of the Wikidata changes might just have to wait a few hours longer
[13:40:37] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[13:41:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:914437|wblistentityusage: Deprecate wbeu prefix, new output format (T300460 T196962)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[13:42:04] <stashbot>	 T196962: Module prefix 'wbeu' is shared between Wikibase\Client\Api\ApiListEntityUsage and Wikibase\Client\Api\ApiPropsEntityUsage - https://phabricator.wikimedia.org/T196962
[13:42:04] <stashbot>	 T300460: [API] WikibaseClient: wblistentityusage API module adds its results to the `query.pages` key in response - https://phabricator.wikimedia.org/T300460
[13:42:10] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:42:10] <Lucas_WMDE>	 testing
[13:42:59] <Lucas_WMDE>	 https://www.wikidata.org/w/api.php?action=query&format=json&list=wblistentityusage&formatversion=2&wbeuentities=Q1 / https://www.wikidata.org/w/api.php?action=query&format=json&list=wblistentityusage&formatversion=2&wbleuentities=Q1 looks good on mwdebug, syncing
[13:43:21] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm1001.wikimedia.org
[13:43:24] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add conftool and service config for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914735 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[13:43:38] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] conftool-data: add config for the k8s ingress for ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914728 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[13:43:58] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.ganeti.makevm for new host kafkamon2003.codfw.wmnet
[13:43:59] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.dns.netbox
[13:44:40] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:45:13] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10JMeybohm) a:05RLazarus→03JMeybohm
[13:45:59] <wikibugs>	 10SRE, 10envoy, 10serviceops: Refactor envoy max_requests_per_connection from Cluster to HttpProtocolOptions - https://phabricator.wikimedia.org/T304124 (10JMeybohm) a:05RLazarus→03None
[13:46:14] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:46:45] <MichaelG_WMDE>	 brb
[13:46:53] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kafkamon2003.codfw.wmnet - herron@cumin1001"
[13:47:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P47393 and previous config saved to /var/cache/conftool/dbconfig/20230503-134740-ladsgroup.json
[13:47:54] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kafkamon2003.codfw.wmnet - herron@cumin1001"
[13:47:54] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:47:54] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.dns.wipe-cache kafkamon2003.codfw.wmnet on all recursors
[13:47:57] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kafkamon2003.codfw.wmnet on all recursors
[13:48:58] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-codfw
[13:49:16] <sukhe>	 jouncebot: nowandnext
[13:49:16] <jouncebot>	 For the next 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1300)
[13:49:16] <jouncebot>	 In 0 hour(s) and 10 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1400)
[13:50:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P47394 and previous config saved to /var/cache/conftool/dbconfig/20230503-135034-ladsgroup.json
[13:51:08] <icinga-wm>	 PROBLEM - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-spreadcheck: use clouds.yaml section rather than envfile [puppet] - 10https://gerrit.wikimedia.org/r/914463 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[13:51:27] <wikibugs>	 (03PS1) 10Herron: kafkamon: add kafkamon[12]003 to fw allow list [puppet] - 10https://gerrit.wikimedia.org/r/914787 (https://phabricator.wikimedia.org/T335424)
[13:51:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nfs-exportd: convert to using mwopenstackclients and --os-cloud [puppet] - 10https://gerrit.wikimedia.org/r/914464 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[13:52:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:914437|wblistentityusage: Deprecate wbeu prefix, new output format (T300460 T196962)]] (duration: 27m 54s)
[13:52:29] <stashbot>	 T196962: Module prefix 'wbeu' is shared between Wikibase\Client\Api\ApiListEntityUsage and Wikibase\Client\Api\ApiPropsEntityUsage - https://phabricator.wikimedia.org/T196962
[13:52:30] <stashbot>	 T300460: [API] WikibaseClient: wblistentityusage API module adds its results to the `query.pages` key in response - https://phabricator.wikimedia.org/T300460
[13:52:39] <wikibugs>	 (03PS5) 10Lucas Werkmeister (WMDE): Turn on experimental Parsoid Read Views support, except on commons & wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T335157) (owner: 10C. Scott Ananian)
[13:52:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T335157) (owner: 10C. Scott Ananian)
[13:52:51] <Lucas_WMDE>	 yippee
[13:53:14] <wikibugs>	 (03PS2) 10Hashar: Checkout tested patch in a branch [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914754
[13:53:27] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on experimental Parsoid Read Views support, except on commons & wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910556 (https://phabricator.wikimedia.org/T335157) (owner: 10C. Scott Ananian)
[13:53:31] <wikibugs>	 (03CR) 10Hashar: Checkout tested patch in a branch (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914754 (owner: 10Hashar)
[13:53:42] <Lucas_WMDE>	 let’s hope it finishes before the window ends…
[13:53:53] <MichaelG_WMDE>	 re
[13:53:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:910556|Turn on experimental Parsoid Read Views support, except on commons & wikidata (T335157)]]
[13:54:00] <stashbot>	 T335157: Experimentally enable Parsoid Read Views pages on query string - https://phabricator.wikimedia.org/T335157
[13:54:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P47395 and previous config saved to /var/cache/conftool/dbconfig/20230503-135402-ladsgroup.json
[13:54:21] <sukhe>	 I can wait for the next one, in fact I have to, so all good :)
[13:55:01] <Lucas_WMDE>	 ok phew ^^
[13:55:14] <Lucas_WMDE>	 then I’ll probably do some more config changes after this one if that’s okay
[13:55:25] <Lucas_WMDE>	 *and backports actually
[13:55:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and cscott: Backport for [[gerrit:910556|Turn on experimental Parsoid Read Views support, except on commons & wikidata (T335157)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[13:55:39] <Lucas_WMDE>	 subbu: can you test the change?
[13:55:55] <sukhe>	 Lucas_WMDE: please ping me when you are done, thanks
[13:55:56] <subbu>	 yes, i can .. is it on the servers?
[13:56:06] <Lucas_WMDE>	 should be on the mwdebug servers
[13:56:09] <Lucas_WMDE>	 sukhe: will do
[13:56:09] <subbu>	 ok.
[13:56:53] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Synchronize only the /srv/prometheus directory when migrating data [puppet] - 10https://gerrit.wikimedia.org/r/914400 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse)
[13:58:19] * Lucas_WMDE tries to remember what the next backport would be anyways
[13:58:38] <Lucas_WMDE>	 backport wbsubscribers fix to wmf branches, then do the config change for it on Test Wikidata, right MichaelG_WMDE?
[13:59:01] <MichaelG_WMDE>	 yes, I think so
[13:59:07] <wikibugs>	 (03CR) 10Ayounsi: templates: add 20.172.in-addr.arpa (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez)
[13:59:12] <Lucas_WMDE>	 ok thanks
[13:59:30] <subbu>	 Lucas_WMDE, this is on eqiad right, not codfw?
[13:59:43] <Lucas_WMDE>	 `scap backport` says it synced to both
[13:59:49] <Lucas_WMDE>	 mwdebug1001, 1002, 2001, 2002
[13:59:50] <subbu>	 ok ..
[13:59:58] <wikibugs>	 (03PS1) 10Stevemunene: Add analytics_product admin group for airflow [puppet] - 10https://gerrit.wikimedia.org/r/914788 (https://phabricator.wikimedia.org/T333000)
[14:00:05] <jouncebot>	 sukhe: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for LVS maintenance . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1400).
[14:00:51] <wikibugs>	 (03Restored) 10Lucas Werkmeister (WMDE): Fix output path of list=wbsubscribers API [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914298 (https://phabricator.wikimedia.org/T300458) (owner: 10Michael Große)
[14:00:55] <wikibugs>	 (03CR) 10Ayounsi: cloudlb: fix BGP IP address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[14:01:09] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Fix output path of list=wbsubscribers API [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914298 (https://phabricator.wikimedia.org/T300458) (owner: 10Michael Große)
[14:01:31] <Lucas_WMDE>	 I’ll +2 the backport already, it’ll take a while anyways
[14:01:44] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "starting gate-and-submit already" [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914298 (https://phabricator.wikimedia.org/T300458) (owner: 10Michael Große)
[14:02:22] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kafkamon2003.codfw.wmnet - herron@cumin1001"
[14:02:26] <MichaelG_WMDE>	 it is alright to move forward with the backport despite the LVS maintenance?
[14:02:27] <subbu>	 hmm .. it doesn't seem to be having any effect at all.
[14:02:39] <wikibugs>	 10ops-codfw, 10DBA: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Jhancock.wm) @Marostegui I can do this today. I tried earlier to reboot the idrac the unobtrusive way, holding the i button until the fans spin up, but it hasn't worked.  The next step is to drain the flea power so we will nee...
[14:02:39] * MichaelG_WMDE has no idea what "LVS maintenance" actually is
[14:02:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P47396 and previous config saved to /var/cache/conftool/dbconfig/20230503-140246-ladsgroup.json
[14:02:57] <subbu>	 Lucas_WMDE, you can sync everywhere, and we can debug after to see what is going on.
[14:03:19] <subbu>	 maybe i am missing something here that Scott probably knows.
[14:03:26] <subbu>	 right now, it looks like a no-op.
[14:03:27] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kafkamon2003.codfw.wmnet - herron@cumin1001"
[14:03:27] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kafkamon2003.codfw.wmnet
[14:03:49] <Lucas_WMDE>	 subbu: ok thanks
[14:04:00] <Lucas_WMDE>	 yeah I didn’t see anything either but it’s not like I knew a lot about what to look for ^^
[14:04:08] <wikibugs>	 10ops-codfw, 10DBA: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Marostegui) @Jhancock.wm I will switchover this host tomorrow from its current master role - so it will be ready for you to power it down whenever you need. I will write here once it is all fine for you to power it off.
[14:04:11] <sukhe>	 MichaelG_WMDE: we are decommissioning and provisioning an LVS server and can't do it when deploys are happening: T334703
[14:04:11] <stashbot>	 T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703
[14:04:13] <Lucas_WMDE>	 (I just picked a random page via the API and loaded it with the URL parameter from the commit message)
[14:04:18] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
[14:04:31] <subbu>	 it works just fine on my local mediwaiki install .. 
[14:04:47] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
[14:05:01] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder)
[14:05:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T335838)', diff saved to https://phabricator.wikimedia.org/P47398 and previous config saved to /var/cache/conftool/dbconfig/20230503-140540-ladsgroup.json
[14:07:10] <Lucas_WMDE>	 sukhe: just to confirm, I have some more time for deploying, as long as I ping you at the end, correct?
[14:07:22] <Lucas_WMDE>	 or did you want to do the LVS thing now after all and I misunderstood?
[14:07:37] <Lucas_WMDE>	 (I’d like to get my backports out of the way but they can wait if needed)
[14:07:37] <sukhe>	 Lucas_WMDE: we have to do the LVS thing now because dc-ops will be on site soon :)
[14:07:43] <Lucas_WMDE>	 hm
[14:07:49] <Lucas_WMDE>	 then I misunderstood your message earlier, sorry
[14:08:01] <sukhe>	 no, that's on me too. I meant that you can finish the existing and last one safely 
[14:08:03] <Lucas_WMDE>	 then I’ll ping you as soon as this scap is done and pause there
[14:08:07] <Lucas_WMDE>	 ok , I see
[14:08:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Andrew) @papaul, note that these hosts are still pending some trial work in codfw1dev so you shouldn't spend any effort on these ho...
[14:08:44] <Lucas_WMDE>	 ok now I understand, “I have to” only meant “have to wait until the scap is done because otherwise all hell breaks loose” :'D
[14:09:00] <sukhe>	 haha
[14:09:02] <sukhe>	 sadly
[14:09:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T335838)', diff saved to https://phabricator.wikimedia.org/P47399 and previous config saved to /var/cache/conftool/dbconfig/20230503-140908-ladsgroup.json
[14:09:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[14:09:23] <sukhe>	 we have plans to fix this and should but yeah, that's more longterm than provisioning the hosts
[14:09:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:910556|Turn on experimental Parsoid Read Views support, except on commons & wikidata (T335157)]] (duration: 15m 27s)
[14:09:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[14:09:28] <stashbot>	 T335157: Experimentally enable Parsoid Read Views pages on query string - https://phabricator.wikimedia.org/T335157
[14:09:31] <Lucas_WMDE>	 sukhe: I’m done for now, go ahead
[14:09:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47400 and previous config saved to /var/cache/conftool/dbconfig/20230503-140932-ladsgroup.json
[14:09:36] <sukhe>	 Lucas_WMDE: thank you!
[14:09:57] <wikibugs>	 (03PS1) 10David Caro: d/changelog: prepare for release 0.95 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914790
[14:09:58] <sukhe>	 !log stop pybal on lvs2007 to drain host for decommissioning
[14:09:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:00] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:10:29] <Lucas_WMDE>	 oh but that means I should retract my +2 because that backport won’t merge so quickly now after all
[14:10:39] <wikibugs>	 (03PS4) 10MdsShakil: Create autopatroller and patroller groups on bn.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829)
[14:10:44] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "nope, this needs to wait until after the LVS window" [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914298 (https://phabricator.wikimedia.org/T300458) (owner: 10Michael Große)
[14:11:12] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply
[14:11:25] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[14:11:35] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[14:11:38] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply
[14:11:46] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[14:11:56] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[14:12:08] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[14:12:23] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[14:12:28] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[14:12:50] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:12:56] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[14:13:08] <icinga-wm>	 PROBLEM - pybal on lvs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[14:13:15] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply
[14:13:20] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[14:13:30] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply
[14:13:31] <wikibugs>	 (03PS1) 10Jbond: apt: drop files from the puppet source [puppet] - 10https://gerrit.wikimedia.org/r/914791
[14:13:33] <wikibugs>	 (03CR) 10MdsShakil: Create autopatroller and patroller groups on bn.wikiquote (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil)
[14:13:58] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply
[14:14:01] <subbu>	 Lucas_WMDE, let me know once the config chagne is everywhere. thanks!
[14:14:06] <wikibugs>	 (03PS5) 10MdsShakil: Create autopatroller and patroller groups on bn.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829)
[14:14:13] <Lucas_WMDE>	 subbu: it should be everywhere by now
[14:14:26] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[14:14:26] <subbu>	 ty
[14:14:26] <Lucas_WMDE>	 (I’m done deploying now)
[14:14:40] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41014/console" [puppet] - 10https://gerrit.wikimedia.org/r/914791 (owner: 10Jbond)
[14:14:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T335838)', diff saved to https://phabricator.wikimedia.org/P47401 and previous config saved to /var/cache/conftool/dbconfig/20230503-141458-ladsgroup.json
[14:14:59] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[14:15:02] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2007 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[14:15:11] <sukhe>	 ^ expected
[14:15:16] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply
[14:15:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] apt: drop files from the puppet source [puppet] - 10https://gerrit.wikimedia.org/r/914791 (owner: 10Jbond)
[14:15:37] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply
[14:15:43] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply
[14:15:49] <wikibugs>	 (03CR) 10Superpes15: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil)
[14:16:01] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply
[14:16:43] <wikibugs>	 (03CR) 10Superpes15: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil)
[14:16:45] <logmsgbot>	 !log sukhe@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767
[14:16:48] <stashbot>	 T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767
[14:17:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47402 and previous config saved to /var/cache/conftool/dbconfig/20230503-141752-ladsgroup.json
[14:17:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance
[14:18:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance
[14:18:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T335838)', diff saved to https://phabricator.wikimedia.org/P47403 and previous config saved to /var/cache/conftool/dbconfig/20230503-141817-ladsgroup.json
[14:24:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T335838)', diff saved to https://phabricator.wikimedia.org/P47404 and previous config saved to /var/cache/conftool/dbconfig/20230503-142427-ladsgroup.json
[14:25:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:26:07] <wikibugs>	 (03PS2) 10EoghanGaffney: [gitlab/failover] Add rollback method [cookbooks] - 10https://gerrit.wikimedia.org/r/914748
[14:26:54] <ottomata>	 !log Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in kafka main clusters - T334733
[14:26:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:57] <stashbot>	 T334733: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733
[14:28:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Revert "sre.hosts.reimage/sre.ganeti.reimage: Delete Puppet state file before reimage" [cookbooks] - 10https://gerrit.wikimedia.org/r/912311 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[14:29:13] <wikibugs>	 10SRE, 10Data-Engineering, 10SRE Observability, 10Event-Platform Value Stream (Sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10Ottomata) Done for Kafka main.  We should do this for Kafka logging as well, so that when...
[14:29:20] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2007 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[14:30:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P47405 and previous config saved to /var/cache/conftool/dbconfig/20230503-143005-ladsgroup.json
[14:31:42] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10Aklapper) Hi @lojo, welcome to Wikimedia Phabricator! This Phabricator account does not use a `@wikimedia.de` email address, and currently there is no WMDE mediawiki.org SUL account [associated](htt...
[14:33:17] <sukhe>	 !log set routing-options static route 208.80.153.224/28 next-hop 10.192.49.7 [move static route for high-traffic1 to lvs2010]: T335777
[14:33:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:20] <stashbot>	 T335777: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777
[14:33:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] modules: duplicate the istio ingress template for 1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914306 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[14:33:34] <wikibugs>	 (03PS5) 10Elukey: modules: add ml-staging cfg to the istio template [deployment-charts] - 10https://gerrit.wikimedia.org/r/914307 (https://phabricator.wikimedia.org/T335756)
[14:34:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/output/914317/41016/" [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[14:34:33] <wikibugs>	 (03Abandoned) 10Elukey: ml-services: limit deployments of experimental to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900381 (owner: 10Elukey)
[14:34:41] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] modules: add ml-staging cfg to the istio template [deployment-charts] - 10https://gerrit.wikimedia.org/r/914307 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[14:36:18] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs2007.codfw.wmnet
[14:36:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[14:37:48] <wikibugs>	 (03PS1) 10Elukey: fast-api: update ingress.istio module version to 1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914793 (https://phabricator.wikimedia.org/T335756)
[14:37:50] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "The issue with the deploy server was due to me trying https://gerrit.wikimedia.org/r/c/operations/puppet/+/906051 where /srv/mediawiki alr" [puppet] - 10https://gerrit.wikimedia.org/r/914777 (owner: 10Jaime Nuche)
[14:37:54] <wikibugs>	 (03PS1) 10Majavah: kubernetes: Remove deprecated state from buildservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914794
[14:38:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  frav1003 - jclark@cumin1001"
[14:39:03] <wikibugs>	 (03PS1) 10Elukey: ml-services: enable the 'mlstaging' ingress flag for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914795 (https://phabricator.wikimedia.org/T335756)
[14:39:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P47406 and previous config saved to /var/cache/conftool/dbconfig/20230503-143933-ladsgroup.json
[14:39:59] <sukhe>	 jouncebot: nowandnext
[14:39:59] <jouncebot>	 For the next 2 hour(s) and 20 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1400)
[14:40:00] <jouncebot>	 In 2 hour(s) and 20 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1700)
[14:40:06] <icinga-wm>	 RECOVERY - Check systemd state on idm1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:40:06] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm1001.wikimedia.org
[14:40:34] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  frav1003 - jclark@cumin1001"
[14:40:34] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[14:41:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] fast-api: update ingress.istio module version to 1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914793 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[14:41:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: enable the 'mlstaging' ingress flag for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/914795 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[14:42:26] <wikibugs>	 (03PS1) 10Eevans: restbase: upgrade Cassandra on restbase2012 & restbase1016 [puppet] - 10https://gerrit.wikimedia.org/r/914797 (https://phabricator.wikimedia.org/T335383)
[14:42:59] <ottomata>	 !log Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in kafka logging clusters - T334733
[14:43:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:02] <stashbot>	 T334733: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733
[14:43:23] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] beta: delete old Puppet client bucket files from deployment server [puppet] - 10https://gerrit.wikimedia.org/r/914777 (owner: 10Jaime Nuche)
[14:43:55] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm1001.wikimedia.org
[14:45:02] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[14:45:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P47407 and previous config saved to /var/cache/conftool/dbconfig/20230503-144511-ladsgroup.json
[14:46:17] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:46:18] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs2007.codfw.wmnet
[14:46:24] <wikibugs>	 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs2007.codfw.wmnet` - lvs2007.codfw.wmnet (**WARN**)   - Downtimed host on Ici...
[14:46:29] <wikibugs>	 10SRE, 10Data-Engineering, 10SRE Observability, 10Event-Platform Value Stream (Sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10Ottomata) Done for logging clusters, and we all done!
[14:48:39] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] restbase: upgrade Cassandra on restbase2012 & restbase1016 [puppet] - 10https://gerrit.wikimedia.org/r/914797 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans)
[14:49:29] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] lvs2007: decommission host for codfw hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/914341 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh)
[14:50:13] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove decommissioned host lvs2007 from lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/914344 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh)
[14:50:26] <wikibugs>	 (03CR) 10Hnowlan: restbase: upgrade Cassandra on restbase2012 & restbase1016 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914797 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans)
[14:51:34] <wikibugs>	 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh)
[14:52:55] <sukhe>	 !log homer "cr*-codfw*" commit "Gerrit: 914344 remove decommissioned host lvs2007": T335777
[14:52:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:58] <stashbot>	 T335777: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777
[14:53:10] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2012.codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[14:53:13] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[14:54:08] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914794 (owner: 10Majavah)
[14:54:12] <sukhe>	 !log [finished] homer "cr*-codfw*" commit "Gerrit: 914344 remove decommissioned host lvs2007": T335777
[14:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:18] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 108, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:54:33] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] kubernetes: Remove deprecated state from buildservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914794 (owner: 10Majavah)
[14:54:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P47408 and previous config saved to /var/cache/conftool/dbconfig/20230503-145440-ladsgroup.json
[14:55:18] <wikibugs>	 (03Merged) 10jenkins-bot: kubernetes: Remove deprecated state from buildservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914794 (owner: 10Majavah)
[14:57:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:57:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez)
[14:57:50] <wikibugs>	 (03PS2) 10David Caro: d/changelog: prepare for release 0.95 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914790
[14:58:12] <wikibugs>	 (03PS1) 10Hokwelum: Increase number of retries for html download [puppet] - 10https://gerrit.wikimedia.org/r/914800
[14:59:04] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1111.eqiad.wmnet - https://phabricator.wikimedia.org/T335836 (10wiki_willy) a:05wiki_willy→03Jclark-ctr
[14:59:36] <sukhe>	 !log fix backup route for high-traffic2 in codfw: set routing-options static route 208.80.153.240/28 next-hop 10.192.17.7
[14:59:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:49] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10Jclark-ctr) a:03Jclark-ctr
[14:59:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet: support using FQDNs instead of hardcoded IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/914317 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[15:00:09] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335775 (10Jclark-ctr) a:03Jclark-ctr
[15:00:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T335838)', diff saved to https://phabricator.wikimedia.org/P47409 and previous config saved to /var/cache/conftool/dbconfig/20230503-150017-ladsgroup.json
[15:00:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[15:00:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[15:00:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47410 and previous config saved to /var/cache/conftool/dbconfig/20230503-150042-ladsgroup.json
[15:01:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T335838)', diff saved to https://phabricator.wikimedia.org/P47411 and previous config saved to /var/cache/conftool/dbconfig/20230503-150103-ladsgroup.json
[15:01:13] <wikibugs>	 (03PS2) 10Hokwelum: Increase number of retries for html download [puppet] - 10https://gerrit.wikimedia.org/r/914800 (https://phabricator.wikimedia.org/T335761)
[15:02:48] <icinga-wm>	 PROBLEM - Check systemd state on db2184 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:24] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2012.codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[15:03:27] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[15:03:44] <wikibugs>	 (03PS3) 10Majavah: d/changelog: prepare for release 0.95 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914790 (owner: 10David Caro)
[15:03:54] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] d/changelog: prepare for release 0.95 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914790 (owner: 10David Caro)
[15:07:00] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[15:07:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47412 and previous config saved to /var/cache/conftool/dbconfig/20230503-150702-ladsgroup.json
[15:08:10] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] d/changelog: prepare for release 0.95 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914790 (owner: 10David Caro)
[15:09:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47413 and previous config saved to /var/cache/conftool/dbconfig/20230503-150947-ladsgroup.json
[15:09:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T335838)', diff saved to https://phabricator.wikimedia.org/P47414 and previous config saved to /var/cache/conftool/dbconfig/20230503-150947-ladsgroup.json
[15:09:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance
[15:09:59] <wikibugs>	 (03Merged) 10jenkins-bot: d/changelog: prepare for release 0.95 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/914790 (owner: 10David Caro)
[15:10:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance
[15:10:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T335838)', diff saved to https://phabricator.wikimedia.org/P47415 and previous config saved to /var/cache/conftool/dbconfig/20230503-151013-ladsgroup.json
[15:16:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T335838)', diff saved to https://phabricator.wikimedia.org/P47416 and previous config saved to /var/cache/conftool/dbconfig/20230503-151627-ladsgroup.json
[15:17:15] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[15:17:18] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[15:17:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] tox: do not skip missing interpreters on CI [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914746 (owner: 10Hashar)
[15:17:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] tox: use default python for local testing [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914747 (owner: 10Hashar)
[15:17:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Checkout tested patch in a branch [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914754 (owner: 10Hashar)
[15:18:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review: Deal with archival of Stretch on Debian mirrors - https://phabricator.wikimedia.org/T335282 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is resolved: * apt sources on remaining stretch servers stopped u...
[15:19:34] <wikibugs>	 (03Merged) 10jenkins-bot: tox: do not skip missing interpreters on CI [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914746 (owner: 10Hashar)
[15:19:36] <wikibugs>	 (03Merged) 10jenkins-bot: tox: use default python for local testing [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914747 (owner: 10Hashar)
[15:19:39] <wikibugs>	 (03Merged) 10jenkins-bot: Checkout tested patch in a branch [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914754 (owner: 10Hashar)
[15:21:34] <wikibugs>	 (03CR) 10Muehlenhoff: "FYI, new access groups need discussion/approval in the weekly SRE Infrastructure Foundations meeting (happening next Monday) so this will " [puppet] - 10https://gerrit.wikimedia.org/r/914788 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene)
[15:22:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P47417 and previous config saved to /var/cache/conftool/dbconfig/20230503-152208-ladsgroup.json
[15:22:15] <wikibugs>	 (03PS1) 10Elukey: fastapi: bump chart's version [deployment-charts] - 10https://gerrit.wikimedia.org/r/914849
[15:23:56] <wikibugs>	 (03PS1) 10Eevans: restbase: upgrade cluster to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/914851 (https://phabricator.wikimedia.org/T335383)
[15:24:23] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[15:24:26] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[15:24:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P47418 and previous config saved to /var/cache/conftool/dbconfig/20230503-152453-ladsgroup.json
[15:25:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] fastapi: bump chart's version [deployment-charts] - 10https://gerrit.wikimedia.org/r/914849 (owner: 10Elukey)
[15:28:09] <wikibugs>	 (03PS1) 10Jbond: 2.5.6: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914853
[15:28:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] 2.5.6: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/914853 (owner: 10Jbond)
[15:29:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[15:30:16] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/914854
[15:30:25] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10MoritzMuehlenhoff)
[15:31:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P47419 and previous config saved to /var/cache/conftool/dbconfig/20230503-153133-ladsgroup.json
[15:31:53] <wikibugs>	 (03CR) 10JHathaway: puppet: use a string rather than a symbol to call a puppet function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914406 (owner: 10JHathaway)
[15:32:04] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] puppet: use a string rather than a symbol to call a puppet function [puppet] - 10https://gerrit.wikimedia.org/r/914406 (owner: 10JHathaway)
[15:32:28] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] puppet7: re-add host core [puppet] - 10https://gerrit.wikimedia.org/r/914408 (owner: 10JHathaway)
[15:32:53] <wikibugs>	 (03PS1) 10Ssingh: lvs2011: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/914856 (https://phabricator.wikimedia.org/T326767)
[15:32:54] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:32:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/914854 (owner: 10Jbond)
[15:33:30] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:34:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet
[15:34:36] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[15:34:39] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[15:34:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[15:35:18] <wikibugs>	 (03PS2) 10Ssingh: lvs2011: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/914856 (https://phabricator.wikimedia.org/T326767)
[15:36:57] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for lvs2011 - pt1979@cumin2002"
[15:37:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P47420 and previous config saved to /var/cache/conftool/dbconfig/20230503-153715-ladsgroup.json
[15:37:32] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts puppetmaster1002.eqiad.wmnet
[15:37:54] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts puppetmaster1002.eqiad.wmnet
[15:38:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for lvs2011 - pt1979@cumin2002"
[15:38:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:38:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet
[15:38:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet
[15:39:48] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Jhancock.wm) @Papaul  Dell kicked back my updated dispatch request for not enough troubleshooting. Since the server was down, I swapped DIMM A6 with DIMM A5 about two hours ago and the server ha...
[15:40:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P47421 and previous config saved to /var/cache/conftool/dbconfig/20230503-154000-ladsgroup.json
[15:40:54] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[15:40:56] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[15:40:59] <wikibugs>	 (03PS3) 10Ssingh: lvs2011: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/914856 (https://phabricator.wikimedia.org/T326767)
[15:40:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host lvs2011.mgmt.codfw.wmnet with reboot policy FORCED
[15:41:18] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[15:41:59] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760)
[15:42:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet
[15:42:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cloudlb: fix BGP IP address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[15:42:53] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[15:42:59] <wikibugs>	 (03PS1) 10Elukey: admin_ng: add ml-staging among helmfile_namespace_certs's options [deployment-charts] - 10https://gerrit.wikimedia.org/r/914859 (https://phabricator.wikimedia.org/T335756)
[15:43:01] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] restbase: upgrade cluster to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/914851 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans)
[15:43:05] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Marostegui) How can Dell kick back this request when their systems logs say: `Multi-bit memory errors are detected on the memory device at location(s) DIMM_A6. Immediately replace the DIMM.` - t...
[15:45:34] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release machinetranslation/staging on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:46:38] <wikibugs>	 (03PS1) 10Jbond: sre.hardward.upgrade-firmware: Ensure we only apply version check to gen 14 [cookbooks] - 10https://gerrit.wikimedia.org/r/914860
[15:46:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P47422 and previous config saved to /var/cache/conftool/dbconfig/20230503-154639-ladsgroup.json
[15:48:45] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[13-27].codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[15:48:47] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[15:51:06] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/914856 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh)
[15:52:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47423 and previous config saved to /var/cache/conftool/dbconfig/20230503-155221-ladsgroup.json
[15:55:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47424 and previous config saved to /var/cache/conftool/dbconfig/20230503-155506-ladsgroup.json
[15:55:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[15:55:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[15:56:00] <sukhe>	 jouncebot: nowandnext
[15:56:00] <jouncebot>	 For the next 1 hour(s) and 3 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1400)
[15:56:00] <jouncebot>	 In 1 hour(s) and 3 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1700)
[15:59:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[15:59:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[15:59:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T335838)', diff saved to https://phabricator.wikimedia.org/P47425 and previous config saved to /var/cache/conftool/dbconfig/20230503-155946-ladsgroup.json
[16:00:35] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10jcrespo) 2 in a row, for hw errors captured in their own hw logs? T335396#8821456 Will we have to send our lawyers so they honor their contract obligations?
[16:00:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T335838)', diff saved to https://phabricator.wikimedia.org/P47426 and previous config saved to /var/cache/conftool/dbconfig/20230503-160039-ladsgroup.json
[16:00:46] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin_ng: add ml-staging among helmfile_namespace_certs's options [deployment-charts] - 10https://gerrit.wikimedia.org/r/914859 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[16:01:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T335838)', diff saved to https://phabricator.wikimedia.org/P47427 and previous config saved to /var/cache/conftool/dbconfig/20230503-160146-ladsgroup.json
[16:03:07] <wikibugs>	 (03PS1) 10Jdlrobson: Enable graphs on test wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940)
[16:03:23] <wikibugs>	 (03CR) 10Herron: [C: 03+2] kafkamon: add kafkamon[12]003 to fw allow list [puppet] - 10https://gerrit.wikimedia.org/r/914787 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron)
[16:05:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance
[16:05:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance
[16:06:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T335838)', diff saved to https://phabricator.wikimedia.org/P47428 and previous config saved to /var/cache/conftool/dbconfig/20230503-160601-ladsgroup.json
[16:06:41] <wikibugs>	 (03PS3) 10Hokwelum: Increase number of retries for html dumps download [puppet] - 10https://gerrit.wikimedia.org/r/914800 (https://phabricator.wikimedia.org/T335761)
[16:07:12] <hauskater>	 hola marostegui - got a minute for a quick DB question?
[16:07:42] <wikibugs>	 (03Abandoned) 10Jbond: sre.hardward.upgrade-firmware: Ensure we only apply version check to gen 14 [cookbooks] - 10https://gerrit.wikimedia.org/r/914860 (owner: 10Jbond)
[16:08:07] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Increase number of retries for html dumps download [puppet] - 10https://gerrit.wikimedia.org/r/914800 (https://phabricator.wikimedia.org/T335761) (owner: 10Hokwelum)
[16:08:44] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts puppetmaster2001.codfw.wmnet
[16:08:57] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts puppetmaster2001.codfw.wmnet
[16:11:32] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760)
[16:12:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T335838)', diff saved to https://phabricator.wikimedia.org/P47429 and previous config saved to /var/cache/conftool/dbconfig/20230503-161235-ladsgroup.json
[16:13:25] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs2011.mgmt.codfw.wmnet with reboot policy FORCED
[16:14:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T335838)', diff saved to https://phabricator.wikimedia.org/P47430 and previous config saved to /var/cache/conftool/dbconfig/20230503-161402-ladsgroup.json
[16:14:24] <wikibugs>	 (03PS1) 10Jbond: sre.hardward.upgrade-firmware: drop support for Gen13 idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/914866
[16:15:22] <icinga-wm>	 RECOVERY - Check systemd state on db2184 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:15:35] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760)
[16:15:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P47431 and previous config saved to /var/cache/conftool/dbconfig/20230503-161545-ladsgroup.json
[16:17:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardward.upgrade-firmware: drop support for Gen13 idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/914866 (owner: 10Jbond)
[16:18:32] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2011']
[16:18:38] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/914772/41019/" [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[16:19:34] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs2011']
[16:19:38] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2011']
[16:19:52] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2011']
[16:20:06] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2011']
[16:20:14] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs2011']
[16:20:58] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760)
[16:23:40] <icinga-wm>	 RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:24:24] <icinga-wm>	 RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:25:33] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2011']
[16:27:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P47432 and previous config saved to /var/cache/conftool/dbconfig/20230503-162741-ladsgroup.json
[16:28:22] <icinga-wm>	 PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:28:31] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[16:29:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P47433 and previous config saved to /var/cache/conftool/dbconfig/20230503-162908-ladsgroup.json
[16:30:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:30:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P47434 and previous config saved to /var/cache/conftool/dbconfig/20230503-163051-ladsgroup.json
[16:31:32] <icinga-wm>	 RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:32:36] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs2011']
[16:35:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:35:52] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:37:18] <icinga-wm>	 PROBLEM - dump of backup1-codfw in codfw on backupmon1001 is CRITICAL: Last dump for backup1-codfw at codfw (db2184) taken on 2023-05-03 16:20:01 is 17 GiB, but the previous one was 15 GiB, a change of +16.0 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[16:42:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P47435 and previous config saved to /var/cache/conftool/dbconfig/20230503-164248-ladsgroup.json
[16:43:37] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye
[16:43:47] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye
[16:44:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P47436 and previous config saved to /var/cache/conftool/dbconfig/20230503-164414-ladsgroup.json
[16:45:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T335838)', diff saved to https://phabricator.wikimedia.org/P47437 and previous config saved to /var/cache/conftool/dbconfig/20230503-164557-ladsgroup.json
[16:46:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[16:46:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[16:46:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T335838)', diff saved to https://phabricator.wikimedia.org/P47438 and previous config saved to /var/cache/conftool/dbconfig/20230503-164622-ladsgroup.json
[16:46:37] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[16:46:41] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[16:47:14] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[16:47:17] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[16:47:41] <wikibugs>	 (03PS1) 10Urbanecm: Mentor dashboard: Move away from alpha/beta/stable [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914836 (https://phabricator.wikimedia.org/T334630)
[16:47:54] <wikibugs>	 (03PS1) 10Urbanecm: Mentor dashboard: Move away from alpha/beta/stable [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914837 (https://phabricator.wikimedia.org/T334630)
[16:52:37] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: add new LVS host lvs2011 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/914871 (https://phabricator.wikimedia.org/T326767)
[16:57:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T335838)', diff saved to https://phabricator.wikimedia.org/P47440 and previous config saved to /var/cache/conftool/dbconfig/20230503-165754-ladsgroup.json
[16:58:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[16:58:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T335838)', diff saved to https://phabricator.wikimedia.org/P47441 and previous config saved to /var/cache/conftool/dbconfig/20230503-165811-ladsgroup.json
[16:58:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[16:58:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T335838)', diff saved to https://phabricator.wikimedia.org/P47442 and previous config saved to /var/cache/conftool/dbconfig/20230503-165818-ladsgroup.json
[16:58:44] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.ganeti.reimage for host kafkamon2003.codfw.wmnet with OS bullseye
[16:58:51] <wikibugs>	 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): 2 VMs for kafkamon - https://phabricator.wikimedia.org/T335426 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by herron@cumin1001 for host kafkamon2003.codfw.wmnet with OS bullseye
[16:59:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T335838)', diff saved to https://phabricator.wikimedia.org/P47443 and previous config saved to /var/cache/conftool/dbconfig/20230503-165920-ladsgroup.json
[16:59:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[16:59:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[16:59:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T335838)', diff saved to https://phabricator.wikimedia.org/P47444 and previous config saved to /var/cache/conftool/dbconfig/20230503-165954-ladsgroup.json
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1700)
[17:00:41] <sukhe>	 ^ please note that there is a scap lock in progres, as we are still provisioning the lvs host in codfw
[17:00:50] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:01:04] <sukhe>	 if there is any deployment for this slot, please let me know and I will lift it and stop the work (and not resume it)
[17:01:56] <wikibugs>	 (03PS2) 10Jdlrobson: Enable graphs on test wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940)
[17:02:14] <wikibugs>	 (03PS1) 10Ottomata: flink-app - quote all flinkConfiguration values [deployment-charts] - 10https://gerrit.wikimedia.org/r/914874
[17:02:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2011.codfw.wmnet with reason: host reimage
[17:03:09] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink-app - quote all flinkConfiguration values [deployment-charts] - 10https://gerrit.wikimedia.org/r/914874 (owner: 10Ottomata)
[17:03:40] <wikibugs>	 (03CR) 10Majavah: Enable graphs on test wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) (owner: 10Jdlrobson)
[17:05:44] <sukhe>	 lifting the lock as it's unlikely we will finish reimaging the next lvs host by then, including the "predictable interfaces" and all that :)
[17:05:47] <logmsgbot>	 !log sukhe@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767 (duration: 169m 01s)
[17:05:50] <stashbot>	 T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767
[17:05:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: host reimage
[17:06:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T335838)', diff saved to https://phabricator.wikimedia.org/P47445 and previous config saved to /var/cache/conftool/dbconfig/20230503-170607-ladsgroup.json
[17:07:19] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm for easier migration and switchovers." [dns] - 10https://gerrit.wikimedia.org/r/914369 (https://phabricator.wikimedia.org/T335797) (owner: 10Dzahn)
[17:07:31] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[17:07:48] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[17:08:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T335838)', diff saved to https://phabricator.wikimedia.org/P47446 and previous config saved to /var/cache/conftool/dbconfig/20230503-170821-ladsgroup.json
[17:13:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add discovery records for miscweb in eqiad and miscweb in codfw [dns] - 10https://gerrit.wikimedia.org/r/914369 (https://phabricator.wikimedia.org/T335797) (owner: 10Dzahn)
[17:13:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P47447 and previous config saved to /var/cache/conftool/dbconfig/20230503-171317-ladsgroup.json
[17:13:20] <wikibugs>	 (03PS3) 10Dzahn: add discovery records for miscweb in eqiad and miscweb in codfw [dns] - 10https://gerrit.wikimedia.org/r/914369 (https://phabricator.wikimedia.org/T335797)
[17:15:28] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafkamon2003.codfw.wmnet with reason: host reimage
[17:18:38] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafkamon2003.codfw.wmnet with reason: host reimage
[17:21:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[17:21:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P47448 and previous config saved to /var/cache/conftool/dbconfig/20230503-172114-ladsgroup.json
[17:22:27] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[17:22:28] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2011.codfw.wmnet with OS bullseye
[17:22:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye completed...
[17:23:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul)
[17:23:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P47449 and previous config saved to /var/cache/conftool/dbconfig/20230503-172328-ladsgroup.json
[17:23:48] <wikibugs>	 (03PS16) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232)
[17:28:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P47450 and previous config saved to /var/cache/conftool/dbconfig/20230503-172824-ladsgroup.json
[17:31:40] <wikibugs>	 (03PS17) 10ArielGlenn: [WIP] Support for testing a new dumps NFS share [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232)
[17:32:34] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kafkamon2003.codfw.wmnet with OS bullseye
[17:32:39] <wikibugs>	 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): 2 VMs for kafkamon - https://phabricator.wikimedia.org/T335426 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by herron@cumin1001 for host kafkamon2003.codfw.wmnet with OS bullseye completed: - kafkamon2003 (**PASS**)   - Remov...
[17:36:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P47451 and previous config saved to /var/cache/conftool/dbconfig/20230503-173620-ladsgroup.json
[17:38:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P47452 and previous config saved to /var/cache/conftool/dbconfig/20230503-173834-ladsgroup.json
[17:40:51] <wikibugs>	 (03PS1) 10Herron: kafkamon: cut over to bullseye exporters [puppet] - 10https://gerrit.wikimedia.org/r/914876 (https://phabricator.wikimedia.org/T335424)
[17:41:11] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:41:11] <inflatador>	 !log bking@cumin1001 reboot wdqs20[13-22].codfw.wmnet T335835
[17:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T335838)', diff saved to https://phabricator.wikimedia.org/P47453 and previous config saved to /var/cache/conftool/dbconfig/20230503-174330-ladsgroup.json
[17:46:26] <wikibugs>	 (03PS1) 10BCornwall: debian/rules: Add --buildsystem=pybuild [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/914877
[17:48:35] <wikibugs>	 (03PS1) 10Jbond: install_server: improve readability of netmask logic [puppet] - 10https://gerrit.wikimedia.org/r/914878
[17:49:27] <wikibugs>	 (03PS2) 10Jbond: install_server: improve readability of netmask logic [puppet] - 10https://gerrit.wikimedia.org/r/914878
[17:50:01] <wikibugs>	 (03PS3) 10Jbond: install_server: improve readability of netmask logic [puppet] - 10https://gerrit.wikimedia.org/r/914878
[17:50:43] <wikibugs>	 (03PS2) 10BCornwall: debian/rules: Add --buildsystem=pybuild [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/914877
[17:51:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/914877 (owner: 10BCornwall)
[17:51:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T335838)', diff saved to https://phabricator.wikimedia.org/P47454 and previous config saved to /var/cache/conftool/dbconfig/20230503-175126-ladsgroup.json
[17:51:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[17:51:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[17:52:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:52:26] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] debian/rules: Add --buildsystem=pybuild [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/914877 (owner: 10BCornwall)
[17:53:25] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:53:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T335838)', diff saved to https://phabricator.wikimedia.org/P47455 and previous config saved to /var/cache/conftool/dbconfig/20230503-175340-ladsgroup.json
[17:53:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[17:53:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[17:54:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T335838)', diff saved to https://phabricator.wikimedia.org/P47456 and previous config saved to /var/cache/conftool/dbconfig/20230503-175404-ladsgroup.json
[17:55:15] <wikibugs>	 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10Papaul) @Jhancock.wm can  you run the netbox offline script and get lvs2007 out of the rack and into storage ? Thanks
[17:56:27] <wikibugs>	 (03CR) 10Jdlrobson: Enable graphs on test wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) (owner: 10Jdlrobson)
[17:57:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
[17:58:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
[17:58:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T335838)', diff saved to https://phabricator.wikimedia.org/P47457 and previous config saved to /var/cache/conftool/dbconfig/20230503-175806-ladsgroup.json
[18:00:06] <jouncebot>	 brennen and jeena: My dear minions, it's time we take the moon! Just kidding. Time for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1800).
[18:00:06] <jouncebot>	 brennen and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T1800).
[18:00:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T335838)', diff saved to https://phabricator.wikimedia.org/P47458 and previous config saved to /var/cache/conftool/dbconfig/20230503-180018-ladsgroup.json
[18:02:34] <brennen>	 o/
[18:02:47] <brennen>	 sukhe: safe to proceed w/train?
[18:04:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T335838)', diff saved to https://phabricator.wikimedia.org/P47459 and previous config saved to /var/cache/conftool/dbconfig/20230503-180438-ladsgroup.json
[18:05:18] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder)
[18:07:14] <wikibugs>	 (03PS4) 10Ssingh: lvs2011: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/914856 (https://phabricator.wikimedia.org/T326767)
[18:07:19] <sukhe>	 brennen: yes please
[18:07:20] <sukhe>	 sorry, just saw
[18:07:35] <brennen>	 no worries!  we're not on a time crunch.
[18:08:20] <brennen>	 !log train 1.41.0-wmf.7 (T330213): logs quiet and no current blockers, rolling to group1
[18:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:23] <stashbot>	 T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213
[18:08:50] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914879 (https://phabricator.wikimedia.org/T330213)
[18:08:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914879 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot)
[18:09:51] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914879 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot)
[18:11:50] <urbanecm>	 for what is worth, i'm currently unable to access Wikimedia sites (connection times out). 
[18:12:26] <RhinosF1>	 I can access enwiki urbanecm
[18:12:32] <dancy>	 Workin for me.
[18:12:36] <RhinosF1>	 Have you tried different network
[18:12:37] <jeena>	 same
[18:12:44] <jeena>	 (working for me)
[18:12:44] <sukhe>	 no visible issues
[18:14:02] <urbanecm>	 okay, might be a wiki-specific issue in the WMCZ's office connection, appears to work via mobile data. sorry for the false alarm then!
[18:14:27] <sukhe>	 np! I am a bit on the edge because we have an LVS host down in codfw. in theory it should not be a problem but if it does, then I am ready to depool codfw :)
[18:15:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P47460 and previous config saved to /var/cache/conftool/dbconfig/20230503-181524-ladsgroup.json
[18:15:47] <wikibugs>	 (03PS2) 10Majavah: hieradata: remove files for long-gone hosts [puppet] - 10https://gerrit.wikimedia.org/r/914268
[18:15:49] <wikibugs>	 (03PS2) 10Majavah: O:wmcs::nfs: delete old primary role files [puppet] - 10https://gerrit.wikimedia.org/r/914269
[18:15:51] <wikibugs>	 (03PS2) 10Majavah: P::ldap::client::labs: drop support for production [puppet] - 10https://gerrit.wikimedia.org/r/914270
[18:15:53] <wikibugs>	 (03PS2) 10Majavah: labstore: remove unused files [puppet] - 10https://gerrit.wikimedia.org/r/914272
[18:16:10] <wikibugs>	 (03Abandoned) 10Majavah: O:wmcs::nfs: delete old test role [puppet] - 10https://gerrit.wikimedia.org/r/914271 (owner: 10Majavah)
[18:16:38] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.7  refs T330213
[18:16:43] <stashbot>	 T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213
[18:19:40] <urbanecm>	 ftr, tracert ends at 195.2.20.74 / ae44-xcr1.att.cw.net, which seems to be within Vodafone's network. 
[18:19:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P47461 and previous config saved to /var/cache/conftool/dbconfig/20230503-181944-ladsgroup.json
[18:20:03] <wikibugs>	 (03PS2) 10Eevans: Add component/cassandra41 for Cassandra 4.1.x releases [puppet] - 10https://gerrit.wikimedia.org/r/912376 (https://phabricator.wikimedia.org/T313814)
[18:20:34] <RhinosF1>	 sukhe: ^
[18:20:49] <RhinosF1>	 Maybe issues between Vodafone and wikimedia then
[18:20:59] <RhinosF1>	 urbanecm: to drmrs or esams?
[18:21:57] <urbanecm>	 esams. seems to work again now though. 
[18:22:15] <sukhe>	 ok great!
[18:22:22] * sukhe loves self-resolving issues
[18:22:24] <sukhe>	 :)
[18:22:26] <urbanecm>	 me too!
[18:22:57] <logmsgbot>	 !log brennen@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.7  refs T330213 (duration: 06m 18s)
[18:23:01] <stashbot>	 T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213
[18:24:39] <RhinosF1>	 Networks confuse me
[18:24:59] <RhinosF1>	 Because they never break when connectivity ops are looking
[18:26:30] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[13-27].codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[18:26:33] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[18:26:48] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[17-33].eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[18:30:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P47462 and previous config saved to /var/cache/conftool/dbconfig/20230503-183030-ladsgroup.json
[18:33:20] <wikibugs>	 (03PS1) 10Dzahn: trafficserver: change name of the miscweb backend to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/914881
[18:34:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P47463 and previous config saved to /var/cache/conftool/dbconfig/20230503-183451-ladsgroup.json
[18:45:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T335838)', diff saved to https://phabricator.wikimedia.org/P47464 and previous config saved to /var/cache/conftool/dbconfig/20230503-184536-ladsgroup.json
[18:45:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[18:46:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[18:46:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T335838)', diff saved to https://phabricator.wikimedia.org/P47465 and previous config saved to /var/cache/conftool/dbconfig/20230503-184610-ladsgroup.json
[18:49:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T335838)', diff saved to https://phabricator.wikimedia.org/P47466 and previous config saved to /var/cache/conftool/dbconfig/20230503-184957-ladsgroup.json
[18:50:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance
[18:50:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance
[18:50:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[18:50:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[18:50:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T335838)', diff saved to https://phabricator.wikimedia.org/P47467 and previous config saved to /var/cache/conftool/dbconfig/20230503-185026-ladsgroup.json
[18:55:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T335838)', diff saved to https://phabricator.wikimedia.org/P47468 and previous config saved to /var/cache/conftool/dbconfig/20230503-185526-ladsgroup.json
[18:56:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T335838)', diff saved to https://phabricator.wikimedia.org/P47469 and previous config saved to /var/cache/conftool/dbconfig/20230503-185654-ladsgroup.json
[18:57:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[19:02:07] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Add component/cassandra41 for Cassandra 4.1.x releases [puppet] - 10https://gerrit.wikimedia.org/r/912376 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[19:09:07] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[19:10:18] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:10:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P47470 and previous config saved to /var/cache/conftool/dbconfig/20230503-191032-ladsgroup.json
[19:12:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P47471 and previous config saved to /var/cache/conftool/dbconfig/20230503-191200-ladsgroup.json
[19:19:26] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001
[19:20:27] <inflatador>	 !log bking@cumin1001 reboot Elastic cluster for T335835
[19:20:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:01] <wikibugs>	 (03PS3) 10Jdlrobson: Enable graphs on test wikipedia and mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940)
[19:24:47] <wikibugs>	 (03PS2) 10Jdlrobson: Explicitly enable MFCustomSiteModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913241 (https://phabricator.wikimedia.org/T270603)
[19:25:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P47472 and previous config saved to /var/cache/conftool/dbconfig/20230503-192538-ladsgroup.json
[19:27:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P47473 and previous config saved to /var/cache/conftool/dbconfig/20230503-192707-ladsgroup.json
[19:29:55] <wikibugs>	 (03CR) 10SBassett: [C: 03+1] "(from a security perspective)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) (owner: 10Jdlrobson)
[19:30:37] <icinga-wm>	 PROBLEM - Check systemd state on elastic2067 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:30:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:31:09] <icinga-wm>	 PROBLEM - Check systemd state on elastic2068 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:31:15] <icinga-wm>	 PROBLEM - Check systemd state on elastic2085 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:34:03] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] jupyterhub-conda: Fix incompatibility with HDFS-FUSE mount (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene)
[19:34:05] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] jupyterhub-conda: Fix incompatibility with HDFS-FUSE mount [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene)
[19:35:15] <icinga-wm>	 RECOVERY - Check systemd state on elastic2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:35:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:35:47] <icinga-wm>	 RECOVERY - Check systemd state on elastic2068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:35:53] <icinga-wm>	 RECOVERY - Check systemd state on elastic2085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:37:22] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001
[19:40:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T335838)', diff saved to https://phabricator.wikimedia.org/P47474 and previous config saved to /var/cache/conftool/dbconfig/20230503-194045-ladsgroup.json
[19:40:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:41:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:42:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T335838)', diff saved to https://phabricator.wikimedia.org/P47475 and previous config saved to /var/cache/conftool/dbconfig/20230503-194213-ladsgroup.json
[19:42:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance
[19:42:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance
[19:42:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T335838)', diff saved to https://phabricator.wikimedia.org/P47476 and previous config saved to /var/cache/conftool/dbconfig/20230503-194238-ladsgroup.json
[19:43:46] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001 - T335835
[19:49:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T335838)', diff saved to https://phabricator.wikimedia.org/P47477 and previous config saved to /var/cache/conftool/dbconfig/20230503-194905-ladsgroup.json
[19:54:55] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:55:41] <icinga-wm>	 PROBLEM - Check systemd state on elastic2048 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:55:55] <icinga-wm>	 PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:56:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:56:17] <icinga-wm>	 PROBLEM - Check systemd state on elastic2082 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:57:15] <icinga-wm>	 RECOVERY - Check systemd state on elastic2048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:57:29] <icinga-wm>	 RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230503T2000).
[20:00:05] <jouncebot>	 MdsShakil and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:22] <MdsShakil>	 Hello 🙋
[20:01:03] <icinga-wm>	 RECOVERY - Check systemd state on elastic2082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:09] <icinga-wm>	 PROBLEM - Check systemd state on elastic2050 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:01:27] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:01:37] <Jdlrobson>	 present
[20:04:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P47478 and previous config saved to /var/cache/conftool/dbconfig/20230503-200411-ladsgroup.json
[20:05:07] <wikibugs>	 (03PS2) 10RLazarus: Render SLO and SLI numbers as percentunit [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032
[20:05:47] <icinga-wm>	 RECOVERY - Check systemd state on elastic2050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:06:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:07:03] <cjming>	 hi - i can deploy
[20:08:53] <cjming>	 MdsShakil: i'll start with yours
[20:09:19] <wikibugs>	 (03CR) 10RLazarus: "Dashboard/slo-Linkrecommendation view: https://grafana.wikimedia.org/dashboard/snapshot/L337vYP1OAmYC0L2jWowYT6R58040t2I" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus)
[20:09:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil)
[20:10:33] <wikibugs>	 (03Merged) 10jenkins-bot: Create autopatroller and patroller groups on bn.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914428 (https://phabricator.wikimedia.org/T335829) (owner: 10MdsShakil)
[20:11:02] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:914428|Create autopatroller and patroller groups on bn.wikiquote (T335829)]]
[20:11:06] <stashbot>	 T335829: Create autopatroller and patroller groups on bnwikiquote - https://phabricator.wikimedia.org/T335829
[20:11:58] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Router handling code should be centralized into mmv.bootstrap [extensions/MultimediaViewer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914301 (https://phabricator.wikimedia.org/T236591) (owner: 10Jdlrobson)
[20:12:07] <wikibugs>	 (03CR) 10RLazarus: Render SLO and SLI numbers as percentunit (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus)
[20:12:59] <logmsgbot>	 !log cjming@deploy1002 cjming and mdsshakil: Backport for [[gerrit:914428|Create autopatroller and patroller groups on bn.wikiquote (T335829)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[20:13:06] <cjming>	 MdsShakil: can you test?
[20:13:42] <logmsgbot>	 !log fab@deploy1002 Started deploy [airflow-dags/research@f8dad05]: (no justification provided)
[20:13:52] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Nice one, simplifies the slo queries as well, sweet" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus)
[20:14:01] <logmsgbot>	 !log fab@deploy1002 Finished deploy [airflow-dags/research@f8dad05]: (no justification provided) (duration: 00m 19s)
[20:14:02] <MdsShakil>	 cjming: look good to me 
[20:14:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: increase thumbor resource limits, eqiad replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914737 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[20:14:18] <cjming>	 great - syncing
[20:14:35] <wikibugs>	 (03Merged) 10jenkins-bot: Router handling code should be centralized into mmv.bootstrap [extensions/MultimediaViewer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914301 (https://phabricator.wikimedia.org/T236591) (owner: 10Jdlrobson)
[20:14:35] <icinga-wm>	 PROBLEM - Check systemd state on elastic2066 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:14:39] <icinga-wm>	 PROBLEM - Check systemd state on elastic2046 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:14:51] <wikibugs>	 (03CR) 10Herron: "oop I spoke too soon, will wait for followup PS" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus)
[20:14:59] <wikibugs>	 (03PS4) 10Clare Ming: Enable graphs on test wikipedia and mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) (owner: 10Jdlrobson)
[20:16:09] <icinga-wm>	 RECOVERY - Check systemd state on elastic2066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:16:11] <icinga-wm>	 RECOVERY - Check systemd state on elastic2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:16:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:17:47] <wikibugs>	 (03CR) 10RLazarus: Render SLO and SLI numbers as percentunit (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus)
[20:18:40] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Render SLO and SLI numbers as percentunit (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus)
[20:19:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P47479 and previous config saved to /var/cache/conftool/dbconfig/20230503-201918-ladsgroup.json
[20:19:39] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:914428|Create autopatroller and patroller groups on bn.wikiquote (T335829)]] (duration: 08m 36s)
[20:19:41] <stashbot>	 T335829: Create autopatroller and patroller groups on bnwikiquote - https://phabricator.wikimedia.org/T335829
[20:19:41] <cjming>	 MdsShakil: should be live!
[20:19:57] <cjming>	 Jdlrobson: starting your patches now
[20:20:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) (owner: 10Jdlrobson)
[20:20:51] <MdsShakil>	 cjming: Thank you!
[20:20:58] <wikibugs>	 (03Merged) 10jenkins-bot: Enable graphs on test wikipedia and mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914863 (https://phabricator.wikimedia.org/T334940) (owner: 10Jdlrobson)
[20:21:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (8) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2046:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:21:48] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:914863|Enable graphs on test wikipedia and mediawiki.org (T334940)]]
[20:21:51] <stashbot>	 T334940: All Graphs broken on Wikimedia wikis (due to security issue T334895) - https://phabricator.wikimedia.org/T334940
[20:22:20] <wikibugs>	 (03PS3) 10Jdlrobson: Explicitly enable MFCustomSiteModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913241 (https://phabricator.wikimedia.org/T270603)
[20:23:18] <logmsgbot>	 !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:914863|Enable graphs on test wikipedia and mediawiki.org (T334940)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[20:23:20] <cjming>	 Jdlrobson: can you test your graphs patch?
[20:24:18] <wikibugs>	 (03CR) 10RLazarus: [V: 03+2 C: 03+2] Render SLO and SLI numbers as percentunit (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914032 (owner: 10RLazarus)
[20:24:29] <Jdlrobson>	 cjming: graphs is looking good to sync
[20:24:36] <cjming>	 fabu - syncing
[20:24:45] <Jdlrobson>	 This will increase client side errors.. im just not sure by how much :)
[20:30:08] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:914863|Enable graphs on test wikipedia and mediawiki.org (T334940)]] (duration: 08m 19s)
[20:30:11] <stashbot>	 T334940: All Graphs broken on Wikimedia wikis (due to security issue T334895) - https://phabricator.wikimedia.org/T334940
[20:30:11] <cjming>	 Jdlrobson: graphs patch should be live - moving on to your 2nd one
[20:30:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913241 (https://phabricator.wikimedia.org/T270603) (owner: 10Jdlrobson)
[20:31:21] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:31:43] <wikibugs>	 (03Merged) 10jenkins-bot: Explicitly enable MFCustomSiteModules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913241 (https://phabricator.wikimedia.org/T270603) (owner: 10Jdlrobson)
[20:32:10] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:913241|Explicitly enable MFCustomSiteModules (T270603)]]
[20:32:13] <stashbot>	 T270603: Module site.styles generates different output depending on mobile cookie, if $wgMFSiteStylesRenderBlocking = true; - https://phabricator.wikimedia.org/T270603
[20:33:17] <Jdlrobson>	 this one should be easy to check
[20:33:31] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[20:33:40] <logmsgbot>	 !log cjming@deploy1002 jdlrobson and cjming: Backport for [[gerrit:913241|Explicitly enable MFCustomSiteModules (T270603)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[20:33:42] <cjming>	 Jdlrobson: wanna check your 2nd patch?
[20:34:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T335838)', diff saved to https://phabricator.wikimedia.org/P47480 and previous config saved to /var/cache/conftool/dbconfig/20230503-203424-ladsgroup.json
[20:36:44] <Jdlrobson>	 checking..
[20:37:01] <Jdlrobson>	 LGTM claime 
[20:37:04] <Jdlrobson>	 cjming: 
[20:37:09] <cjming>	 cool - syncing
[20:37:45] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:42:34] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:913241|Explicitly enable MFCustomSiteModules (T270603)]] (duration: 10m 23s)
[20:42:38] <stashbot>	 T270603: Module site.styles generates different output depending on mobile cookie, if $wgMFSiteStylesRenderBlocking = true; - https://phabricator.wikimedia.org/T270603
[20:43:15] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:914301|Router handling code should be centralized into mmv.bootstrap (T236591)]]
[20:43:18] <stashbot>	 T236591: Exiting an image displayed via mediaviewer on wikipedia takes you back one site in browser history instead of taking you to base article - https://phabricator.wikimedia.org/T236591
[20:43:42] <cjming>	 Jdlrobson: 2nd patch should be live - doing your backport now
[20:43:42] <rzl>	 (httpbb succeeded on a retry, so that error was unrelated to the deploy)
[20:43:45] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:44:46] <logmsgbot>	 !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:914301|Router handling code should be centralized into mmv.bootstrap (T236591)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[20:45:05] <cjming>	 Jdlrobson: is backport testable?
[20:45:11] <Jdlrobson>	 yep!
[20:46:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2040:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:46:15] <icinga-wm>	 PROBLEM - Check systemd state on elastic2069 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:46:19] <icinga-wm>	 PROBLEM - Check systemd state on elastic2040 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:47:05] <icinga-wm>	 PROBLEM - Check systemd state on elastic2055 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:47:21] <Jdlrobson>	 cjming: let me know when
[20:47:30] <cjming>	 Jdlrobson: oh - lmk if i should sync?
[20:47:37] <cjming>	 please test
[20:47:49] <icinga-wm>	 RECOVERY - Check systemd state on elastic2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:47:49] <Jdlrobson>	 yep lgtm
[20:47:56] <cjming>	 nice - going live
[20:48:17] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:50:13] <icinga-wm>	 RECOVERY - Check systemd state on elastic2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:50:53] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:50:59] <icinga-wm>	 RECOVERY - Check systemd state on elastic2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:51:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2040:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:53:23] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:914301|Router handling code should be centralized into mmv.bootstrap (T236591)]] (duration: 10m 08s)
[20:53:27] <stashbot>	 T236591: Exiting an image displayed via mediaviewer on wikipedia takes you back one site in browser history instead of taking you to base article - https://phabricator.wikimedia.org/T236591
[20:53:41] <cjming>	 Jdlrobson: all live!
[20:53:51] <Jdlrobson>	 THANKS A BUNCH CLARE!
[20:53:56] <cjming>	 lol - yw!
[20:54:11] <cjming>	 !log end of UTC late backport window
[20:54:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:51] <icinga-wm>	 PROBLEM - Check systemd state on elastic2038 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:06:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:06:33] <icinga-wm>	 PROBLEM - Check systemd state on elastic2062 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:07:27] <icinga-wm>	 RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:11:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:11:15] <brett>	 !log Upgrading pybal to 1.15.11 on lvs4010
[21:11:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:21] <icinga-wm>	 RECOVERY - Check systemd state on elastic2062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:18:45] <icinga-wm>	 PROBLEM - Check systemd state on elastic2056 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:18:57] <icinga-wm>	 PROBLEM - Check systemd state on elastic2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:20:21] <icinga-wm>	 RECOVERY - Check systemd state on elastic2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:20:31] <icinga-wm>	 RECOVERY - Check systemd state on elastic2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:22:40] <logmsgbot>	 !log milimetric@deploy1002 Started deploy [analytics/refinery@c53c095]: Refinery deploy [analytics/refinery@c53c095]
[21:31:03] <logmsgbot>	 !log milimetric@deploy1002 Finished deploy [analytics/refinery@c53c095]: Refinery deploy [analytics/refinery@c53c095] (duration: 08m 22s)
[21:31:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:31:13] <icinga-wm>	 PROBLEM - Check systemd state on elastic2060 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:31:36] <brett>	 !log Uploaded pybal_1.15.11 to apt1001 via reprepro
[21:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:49] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[17-33].eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[21:31:51] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[21:32:01] <icinga-wm>	 PROBLEM - Check systemd state on elastic2085 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:32:49] <icinga-wm>	 RECOVERY - Check systemd state on elastic2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:35:11] <icinga-wm>	 RECOVERY - Check systemd state on elastic2085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:35:47] <wikibugs>	 (03PS1) 10BCornwall: Revert "Revert "pybal: Switch ulsfo LVS to use Maglev scheduler"" [puppet] - 10https://gerrit.wikimedia.org/r/914838
[21:36:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:39:59] <wikibugs>	 (03PS2) 10BCornwall: Revert "Revert "pybal: Switch ulsfo LVS to use Maglev scheduler"" [puppet] - 10https://gerrit.wikimedia.org/r/914838
[21:40:17] <icinga-wm>	 PROBLEM - Check systemd state on elastic2072 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:40:38] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] Revert "Revert "pybal: Switch ulsfo LVS to use Maglev scheduler"" [puppet] - 10https://gerrit.wikimedia.org/r/914838 (owner: 10BCornwall)
[21:41:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2051:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:41:20] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41025/console" [puppet] - 10https://gerrit.wikimedia.org/r/914838 (owner: 10BCornwall)
[21:41:51] <icinga-wm>	 RECOVERY - Check systemd state on elastic2072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:42:10] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] Revert "Revert "pybal: Switch ulsfo LVS to use Maglev scheduler"" [puppet] - 10https://gerrit.wikimedia.org/r/914838 (owner: 10BCornwall)
[21:43:23] <logmsgbot>	 !log milimetric@deploy1002 Started deploy [analytics/refinery@c53c095] (thin): Deploy THIN [analytics/refinery@c53c095]
[21:43:29] <logmsgbot>	 !log milimetric@deploy1002 Finished deploy [analytics/refinery@c53c095] (thin): Deploy THIN [analytics/refinery@c53c095] (duration: 00m 06s)
[21:46:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2051:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:48:05] <icinga-wm>	 PROBLEM - Check systemd state on elastic2053 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:48:05] <wikibugs>	 (03PS1) 10Eevans: aqs: canary aqs1010 & aqs2001 to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/914897 (https://phabricator.wikimedia.org/T335383)
[21:49:18] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/914897 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans)
[21:52:45] <wikibugs>	 (03PS2) 10Eevans: aqs: canary aqs1010 & aqs2001 to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/914897 (https://phabricator.wikimedia.org/T335383)
[21:53:30] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/914897 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans)
[21:55:21] <brett>	 !log Disable puppet on lvs4008 for new pybal deployment (just in case immediate config rollback is required) - T263797
[21:55:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:55:25] <stashbot>	 T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797
[21:55:59] <icinga-wm>	 RECOVERY - Check systemd state on elastic2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:56:59] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] aqs: canary aqs1010 & aqs2001 to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/914897 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans)
[21:57:15] <icinga-wm>	 PROBLEM - Check systemd state on elastic2064 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:58:51] <icinga-wm>	 RECOVERY - Check systemd state on elastic2064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:00:26] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs2001.codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[22:00:30] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[22:06:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:08:05] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs2001.codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[22:08:09] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[22:08:27] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/914897 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans)
[22:10:02] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder)
[22:11:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:11:28] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs1010.eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[22:15:33] <icinga-wm>	 PROBLEM - Check systemd state on elastic2058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:16:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (17) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:17:09] <icinga-wm>	 RECOVERY - Check systemd state on elastic2058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:19:21] <wikibugs>	 (03PS2) 10Dzahn: trafficserver: change name of the miscweb backend to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/914881
[22:19:40] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs1010.eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[22:19:44] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[22:19:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] trafficserver: change name of the miscweb backend to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/914881 (owner: 10Dzahn)
[22:21:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (11) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:21:26] <wikibugs>	 (03CR) 10Dzahn: gerrit: move hieradata from role/common to common/profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911920 (owner: 10Dzahn)
[22:24:25] <icinga-wm>	 PROBLEM - Check systemd state on elastic2044 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:24:28] <wikibugs>	 (03PS3) 10Dzahn: trafficserver: change name of the miscweb backend to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/914881
[22:26:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:27:20] <wikibugs>	 (03PS2) 10Dzahn: gerrit: move hieradata from role/common to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/911920
[22:27:22] <wikibugs>	 (03PS1) 10Dzahn: gerrit: move all gerrit::profile hiera keys to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/914901
[22:28:05] <wikibugs>	 (03CR) 10Dzahn: gerrit: move hieradata from role/common to common/profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911920 (owner: 10Dzahn)
[22:28:19] <wikibugs>	 (03Abandoned) 10Dzahn: gerrit: move all gerrit::profile hiera keys to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/914901 (owner: 10Dzahn)
[22:30:53] <wikibugs>	 (03PS3) 10Dzahn: gerrit: move hieradata from role/common to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/911920
[22:31:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (9) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2043:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:33:00] <wikibugs>	 (03PS1) 10Zabe: Start writing to af_actor/afh_actor in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914903 (https://phabricator.wikimedia.org/T334295)
[22:33:15] <zabe>	 jouncebot: nowandnext
[22:33:15] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 26 minute(s)
[22:33:15] <jouncebot>	 In 7 hour(s) and 26 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T0600)
[22:33:15] <jouncebot>	 In 7 hour(s) and 26 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T0600)
[22:33:19] <icinga-wm>	 PROBLEM - Check systemd state on elastic2080 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:33:49] <wikibugs>	 (03CR) 10Dzahn: gerrit: move hieradata from role/common to common/profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911920 (owner: 10Dzahn)
[22:33:55] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Start writing to af_actor/afh_actor in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914903 (https://phabricator.wikimedia.org/T334295) (owner: 10Zabe)
[22:34:39] <tzatziki>	 !log removing 12 files for legal compliance
[22:34:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:40] <wikibugs>	 (03Merged) 10jenkins-bot: Start writing to af_actor/afh_actor in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914903 (https://phabricator.wikimedia.org/T334295) (owner: 10Zabe)
[22:35:30] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:914903|Start writing to af_actor/afh_actor in group1 wikis (T334295)]]
[22:35:30] <wikibugs>	 (03PS4) 10Dzahn: gerrit: move hieradata from role/common to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/911920
[22:35:32] <stashbot>	 T334295: Write to af_actor/afh_actor in production - https://phabricator.wikimedia.org/T334295
[22:35:33] <icinga-wm>	 RECOVERY - Check systemd state on elastic2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:37:06] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:914903|Start writing to af_actor/afh_actor in group1 wikis (T334295)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[22:41:07] <icinga-wm>	 PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:41:11] <icinga-wm>	 RECOVERY - Check systemd state on elastic2080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:41:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2042:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:41:21] <icinga-wm>	 PROBLEM - Check systemd state on elastic2083 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:41:21] <icinga-wm>	 PROBLEM - Check systemd state on elastic2081 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:42:41] <icinga-wm>	 RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:42:43] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:914903|Start writing to af_actor/afh_actor in group1 wikis (T334295)]] (duration: 07m 13s)
[22:42:47] <stashbot>	 T334295: Write to af_actor/afh_actor in production - https://phabricator.wikimedia.org/T334295
[22:42:55] <icinga-wm>	 RECOVERY - Check systemd state on elastic2083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:42:55] <icinga-wm>	 RECOVERY - Check systemd state on elastic2081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:46:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2042:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:50:03] <icinga-wm>	 PROBLEM - Check systemd state on elastic2086 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:51:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2042:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:51:39] <icinga-wm>	 RECOVERY - Check systemd state on elastic2086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:56:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2047:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:57:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[22:57:19] <icinga-wm>	 PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:05:13] <icinga-wm>	 RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:06:43] <icinga-wm>	 PROBLEM - Check systemd state on elastic2073 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:07:05] <icinga-wm>	 PROBLEM - Check systemd state on elastic2061 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:17] <icinga-wm>	 RECOVERY - Check systemd state on elastic2073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:39] <icinga-wm>	 RECOVERY - Check systemd state on elastic2061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:10:02] <tzatziki>	 !log removing 1 file for legal compliance
[23:10:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:47] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001 - T335835
[23:16:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:21:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:34:30] <wikibugs>	 (03PS1) 10EoghanGaffney: [spicerack/decorators] Don't miss dry_run if it's disabled in kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/914923 (https://phabricator.wikimedia.org/T335855)
[23:35:11] <wikibugs>	 (03PS2) 10EoghanGaffney: [spicerack/decorators] Don't miss dry_run if it's disabled in kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/914923 (https://phabricator.wikimedia.org/T335855)
[23:47:22] <wikibugs>	 (03PS1) 10Xcollazo: Add configs to spark-defaults.conf to enable Iceberg. [puppet] - 10https://gerrit.wikimedia.org/r/914928 (https://phabricator.wikimedia.org/T335721)