[00:05:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:05:58] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:07:23] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs: Restart for upgrade to JVM 11.0.31 - eevans@cumin1003
[00:08:26] <jinxer-wm>	 FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:16:35] <wikibugs>	 (03PS1) 10Sbisson: Enable the Article Guidance experiment on simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287043 (https://phabricator.wikimedia.org/T426278)
[00:16:41] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:17:12] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287043 (https://phabricator.wikimedia.org/T426278) (owner: 10Sbisson)
[00:19:03] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[00:37:49] <jinxer-wm>	 FIRING: DiskSpace: Disk space build2001:9100:/ 1.43% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[01:10:01] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1287044
[01:10:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1287044 (owner: 10TrainBranchBot)
[01:21:12] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1287044 (owner: 10TrainBranchBot)
[02:00:37] <logmsgbot>	 !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
[02:07:26] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 49s)
[02:09:23] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:31:28] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:31:30] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:34:23] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:23] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:46:26] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:08:41] <jinxer-wm>	 FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:11:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11920464 (10Papaul)
[04:15:53] <wikibugs>	 (03CR) 10WAN233: change logo at zh-classical wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233)
[04:19:04] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:37:49] <jinxer-wm>	 FIRING: DiskSpace: Disk space build2001:9100:/ 1.429% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[04:56:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:04:28] <logmsgbot>	 !log marostegui@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 5:00:00 on 13 hosts with reason: Sanitarium s2 master: reimage to Debian Trixie
[05:04:54] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 13 hosts with reason: Sanitarium s7 master: reimage to Debian Trixie
[05:05:26] <wikibugs>	 (03PS1) 10Marostegui: db1158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287080 (https://phabricator.wikimedia.org/T425388)
[05:05:45] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1158.eqiad.wmnet with reason: Reimage to Trixie
[05:05:50] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1158: Reimage to Trixie
[05:06:18] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1158: Reimage to Trixie
[05:10:02] <logmsgbot>	 marostegui@cumin1003 reimage (PID 3741973) is awaiting input
[05:12:26] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1158.eqiad.wmnet with OS trixie
[05:12:47] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287080 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui)
[05:25:42] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1158.eqiad.wmnet with reason: host reimage
[05:29:29] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1158.eqiad.wmnet with reason: host reimage
[05:38:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:38:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:38:43] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1287248
[05:40:14] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1287248 (owner: 10Marostegui)
[05:41:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:44:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:46:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:46:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:50:44] <icinga-wm>	 PROBLEM - SSH on an-worker1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:51:20] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1158.eqiad.wmnet with OS trixie
[05:51:34] <icinga-wm>	 RECOVERY - SSH on an-worker1200 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:54:00] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1158: after reimage to trixie
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T0600).
[06:33:51] <kart_>	 Deploying cxserver..
[06:35:15] <wikibugs>	 (03PS1) 10Abijeet Patro: ULS rewrite: Enable on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287288
[06:39:25] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1158: after reimage to trixie
[06:39:33] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc2013: Replacing HW T418973
[06:39:33] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[06:39:36] <stashbot>	 T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973
[06:39:41] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[06:39:41] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2013: Replacing HW T418973
[06:40:00] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply
[06:40:35] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[06:41:02] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc[2013,2023].codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance on pc3
[06:42:14] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize pc2023 [puppet] - 10https://gerrit.wikimedia.org/r/1287289 (https://phabricator.wikimedia.org/T418973)
[06:43:05] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize pc2023 [puppet] - 10https://gerrit.wikimedia.org/r/1287289 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui)
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T0700).
[07:00:04] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:39] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[07:01:34] <kart_>	 !log Update cxserver to 2026-04-23-114216-production (T423002)
[07:01:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:37] <stashbot>	 T423002: Migrate cxserver in production to node24 - https://phabricator.wikimedia.org/T423002
[07:07:15] <wikibugs>	 (03PS1) 10Ryan Kemper: hadoop.reboot-workers: drop custom --dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/1287290 (https://phabricator.wikimedia.org/T411568)
[07:07:47] <wikibugs>	 (03PS8) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1.This is done by changing the previous refinery-sqoop-whole-mediawiki.sh from one big sequential set of sqoops to a parallel structure: - [...]-centralauth-production.sh to sqoop the centralauth production tables. - [...]-mediawiki-clouddb.sh to sqoop the cloudb tables. - [...]-mediawiki-production.sh to sqoop production replicas tabl
[07:07:47] <wikibugs>	 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355)
[07:08:27] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] hadoop.reboot-workers: make host override smarter (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214664 (https://phabricator.wikimedia.org/T411568) (owner: 10Ryan Kemper)
[07:09:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] changes to accelerate sqoop landing for mediawiki_history_incremental_v1.This is done by changing the previous refinery-sqoop-whole-mediawiki.sh from one big sequential set of sqoops to a parallel structure: - [...]-centralauth-production.sh to sqoop the centralauth production tables. - [...]-mediawiki-clouddb.sh to sqoop the cloudb tables. - [...]-mediawiki-production.sh to sqoop production rep
[07:09:43] <wikibugs>	 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata)
[07:13:48] <wikibugs>	 (03PS2) 10Ryan Kemper: airflow-test-k8s: add ldap-sync task-pod egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286750 (https://phabricator.wikimedia.org/T420691)
[07:21:53] <wikibugs>	 (03PS9) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1. [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355)
[07:23:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] changes to accelerate sqoop landing for mediawiki_history_incremental_v1. [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata)
[07:25:31] <wikibugs>	 (03PS3) 10Ryan Kemper: cirrussearch: install atop utility [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[07:25:42] <wikibugs>	 (03CR) 10Ryan Kemper: "I added a guard so we won't have the PCC failure (would break puppet on cirrussearch afaict)" [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[07:26:17] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[07:29:15] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] "LGTM now; pcc's happy" [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[07:29:45] <wikibugs>	 (03PS10) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1. [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355)
[07:31:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] changes to accelerate sqoop landing for mediawiki_history_incremental_v1. [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata)
[07:34:35] <wikibugs>	 (03PS11) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1. [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355)
[07:49:02] <wikibugs>	 (03PS1) 10Elukey: docker_registry: allow multiple docker instances [puppet] - 10https://gerrit.wikimedia.org/r/1287292 (https://phabricator.wikimedia.org/T420978)
[08:00:05] <jouncebot>	 andre and brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T0800).
[08:00:06] <andre>	 I will now start promoting group2 wikis to 1.47.0-wmf.2
[08:01:34] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.47.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287355 (https://phabricator.wikimedia.org/T423911)
[08:01:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287355 (https://phabricator.wikimedia.org/T423911) (owner: 10TrainBranchBot)
[08:02:43] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.47.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287355 (https://phabricator.wikimedia.org/T423911) (owner: 10TrainBranchBot)
[08:04:55] <wikibugs>	 (03PS7) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512)
[08:06:30] <wikibugs>	 (03CR) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey)
[08:06:36] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey)
[08:06:53] <effie>	 jouncebot: next
[08:06:53] <jouncebot>	 In 1 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1000)
[08:08:41] <jinxer-wm>	 FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:08:49] <logmsgbot>	 !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.47.0-wmf.2  refs T423911
[08:08:54] <stashbot>	 T423911: 1.47.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T423911
[08:10:06] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:10:06] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:12:35] <jinxer-wm>	 RESOLVED: DiskSpace: Disk space build2001:9100:/ 1.435% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[08:19:04] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:21:30] <jynus>	 the pull failed, gerrit issue?
[08:34:22] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] thanos/query: set thanos-query alert.query-url [puppet] - 10https://gerrit.wikimedia.org/r/1286811 (https://phabricator.wikimedia.org/T425400) (owner: 10Tiziano Fogli)
[08:37:11] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db2149 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1287356 (https://phabricator.wikimedia.org/T424341)
[08:38:08] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2149 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1287356 (https://phabricator.wikimedia.org/T424341) (owner: 10Marostegui)
[08:39:17] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove db2149 T424341', diff saved to https://phabricator.wikimedia.org/P92520 and previous config saved to /var/cache/conftool/dbconfig/20260514-083916-marostegui.json
[08:39:20] <stashbot>	 T424341: decommission db2149.codfw.wmnet - https://phabricator.wikimedia.org/T424341
[08:40:04] <wikibugs>	 (03PS1) 10Marostegui: db2149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287357 (https://phabricator.wikimedia.org/T424341)
[08:40:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287357 (https://phabricator.wikimedia.org/T424341) (owner: 10Marostegui)
[08:49:26] <effie>	 andre: how far down are you ?
[08:50:10] <andre>	 effie: Done, go ahead
[08:50:12] <andre>	 Seeing one spike but that's on a closed wiki and nothing to roll back for
[08:50:22] <effie>	 grand thank you 
[08:51:38] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286875 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[08:53:17] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] opensearch-ttmserver: switch to opensearch 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286957 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko)
[08:53:43] <jynus>	 cumin2002: I think it is because there are local changes on the homer repo
[08:53:47] <wikibugs>	 (03Merged) 10jenkins-bot: ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286875 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[08:53:48] <jynus>	 I will tell the netops
[08:54:16] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hosts.reimage for host mc1065.eqiad.wmnet with OS bullseye
[08:54:18] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1066.eqiad.wmnet with OS bullseye
[08:54:20] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1067.eqiad.wmnet with OS bullseye
[08:54:27] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1068.eqiad.wmnet with OS bullseye
[08:55:13] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply
[08:55:33] <wikibugs>	 (03Merged) 10jenkins-bot: opensearch-ttmserver: switch to opensearch 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286957 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko)
[08:55:38] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply
[08:56:41] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:06:32] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1066.eqiad.wmnet with reason: host reimage
[09:06:37] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1067.eqiad.wmnet with reason: host reimage
[09:06:49] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1068.eqiad.wmnet with reason: host reimage
[09:06:51] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1065.eqiad.wmnet with reason: host reimage
[09:07:31] <wikibugs>	 (03CR) 10Btullis: changes to accelerate sqoop landing for mediawiki_history_incremental_v1. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata)
[09:10:38] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1066.eqiad.wmnet with reason: host reimage
[09:11:24] <wikibugs>	 (03CR) 10Btullis: [C:03+1] archiva: block scraper UAs at nginx [puppet] - 10https://gerrit.wikimedia.org/r/1286536 (https://phabricator.wikimedia.org/T426114) (owner: 10Ryan Kemper)
[09:11:43] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1287361 (https://phabricator.wikimedia.org/T426291)
[09:11:56] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: (WIP)ml-services: add qwen36-27b to experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680)
[09:14:28] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1068.eqiad.wmnet with reason: host reimage
[09:16:25] <wikibugs>	 (03PS1) 10Marco Fossati: Scale share-highlight card to fit small viewports [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287363 (https://phabricator.wikimedia.org/T426247)
[09:16:41] <wikibugs>	 (03PS2) 10Effie Mouzeli: api-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285340 (https://phabricator.wikimedia.org/T419976)
[09:17:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287363 (https://phabricator.wikimedia.org/T426247) (owner: 10Marco Fossati)
[09:17:16] <icinga-wm>	 PROBLEM - Memcached on mc1065 is CRITICAL: connect to address 10.64.177.8 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:17:16] <icinga-wm>	 PROBLEM - Memcached on mc1067 is CRITICAL: connect to address 10.64.183.11 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:18:39] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1065.eqiad.wmnet with reason: host reimage
[09:19:04] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] api-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285340 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[09:20:55] <Emperor>	 !log rebalance codfw swift rings T354872
[09:20:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:58] <stashbot>	 T354872: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872
[09:21:07] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285340 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[09:21:52] <wikibugs>	 (03PS1) 10CWilliams: icinga/cgi.cfg: Adding CWilliams to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1287364 (https://phabricator.wikimedia.org/T426292)
[09:23:20] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1067.eqiad.wmnet with reason: host reimage
[09:24:53] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] icinga/cgi.cfg: Adding CWilliams to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1287364 (https://phabricator.wikimedia.org/T426292) (owner: 10CWilliams)
[09:25:16] <icinga-wm>	 RECOVERY - Memcached on mc1065 is OK: TCP OK - 0.000 second response time on 10.64.177.8 port 11214 https://wikitech.wikimedia.org/wiki/Memcached
[09:25:36] <wikibugs>	 (03CR) 10Zabe: "Do we want to try to implement some sort of "slow rollout" for this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[09:25:57] <wikibugs>	 (03CR) 10Zabe: "Do we want to try to implement some sort of "slow rollout" for this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[09:26:00] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1066.eqiad.wmnet with OS bullseye
[09:26:40] <wikibugs>	 (03CR) 10CWilliams: [C:03+2] icinga/cgi.cfg: Adding CWilliams to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1287364 (https://phabricator.wikimedia.org/T426292) (owner: 10CWilliams)
[09:27:46] <wikibugs>	 (03PS1) 10MVernon: swift: remove 2 drained nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1287365 (https://phabricator.wikimedia.org/T354872)
[09:29:58] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] swift: remove 2 drained nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1287365 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[09:30:16] <icinga-wm>	 RECOVERY - Memcached on mc1067 is OK: TCP OK - 0.000 second response time on 10.64.183.11 port 11214 https://wikitech.wikimedia.org/wiki/Memcached
[09:30:27] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1068.eqiad.wmnet with OS bullseye
[09:33:49] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1065.eqiad.wmnet with OS bullseye
[09:39:22] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1067.eqiad.wmnet with OS bullseye
[09:41:51] <wikibugs>	 (03PS1) 10JavierMonton: stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287366 (https://phabricator.wikimedia.org/T423920)
[09:43:51] <jinxer-wm>	 FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ...
[09:43:51] <jinxer-wm>	 IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[09:44:08] <bjensen>	 !ack
[09:44:09] <sirenbot>	 7929 (ACKED)  TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw)
[09:44:57] <bjensen>	 that's the codfw -> eqsin link?
[09:46:33] <XioNoX>	 bjensen: yep
[09:46:42] <XioNoX>	 on my laptop in 5/10min
[09:47:01] <XioNoX>	 but it's scrapping in eqsin
[09:49:50] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2152.codfw.wmnet: Host will be decommissioned
[09:51:36] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[09:51:59] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[09:54:03] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons files doesn't update - https://phabricator.wikimedia.org/T426293#11920913 (10A_smart_kitten) FWIW, if the correct image to be displayed for https://commons.wikimedia.org/wiki/File:CitationHelper_-_VE_Editor_Toolbar.png is the one in @aklapper's screenshot, then it seems...
[09:54:26] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: remove 2 drained nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1287365 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[09:54:52] <logmsgbot>	 !log cwilliams@cumin1003 END (ERROR) - Cookbook sre.mysql.depool (exit_code=97) depool db2152.codfw.wmnet: Host will be decommissioned
[09:55:32] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[09:55:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[09:58:49] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:58:51] <jinxer-wm>	 RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ...
[09:58:56] <jinxer-wm>	 IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[09:59:03] <wikibugs>	 (03PS1) 10Robertsky: throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295)
[09:59:13] <wikibugs>	 (03PS1) 10Phuedx: ext.wikimediaEvents: Add synth-aa-ncs-1 experiment [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287368 (https://phabricator.wikimedia.org/T419514)
[09:59:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287368 (https://phabricator.wikimedia.org/T419514) (owner: 10Phuedx)
[09:59:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1000)
[10:00:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky)
[10:00:29] <wikibugs>	 (03PS2) 10Effie Mouzeli: redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976)
[10:00:31] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons files doesn't update - https://phabricator.wikimedia.org/T426293#11920979 (10jcrespo) Hi, @Jcubic thanks for the report. On upload of a new version, caches are normally purged from our content delivery network, however how much time it takes for that to propagate depends...
[10:01:23] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons files doesn't update - https://phabricator.wikimedia.org/T426293#11920981 (10A_smart_kitten) I seem to be getting different responses for <https://upload.wikimedia.org/wikipedia/commons/3/36/CitationHelper_-_VE_Editor_Toolbar.png> from different datacenters. Potentially(...
[10:02:07] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2152: Host will be decommissioned
[10:02:26] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons files doesn't update - https://phabricator.wikimedia.org/T426293#11920984 (10jcrespo) >>! In T426293#11920981, @A_smart_kitten wrote: > I seem to be getting different responses for <https://upload.wikimedia.org/wikipedia/commons/3/36/CitationHelper_-_VE_Editor_Toolbar.pn...
[10:02:27] <logmsgbot>	 !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2152: Host will be decommissioned
[10:02:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDropPercent: Core router normal + high priority queue drops are high on cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, ...
[10:02:51] <jinxer-wm>	 MAC filter) {#1016}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#CoreRouterInterfaceDropPercent - https://grafana.wikimedia.org/d/5p97dAASz/queue-and-error-stats-by-network-device?var-site=eqsin+prometheus%2Fops&var-device=cr3-eqsin&var-interface=xe-0%2F1%2F3 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDropPercent
[10:03:09] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons files doesn't update - https://phabricator.wikimedia.org/T426293#11920988 (10A_smart_kitten) I think we were both typing comments here at the same time :D
[10:05:08] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:05:10] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:07:48] <wikibugs>	 (03PS3) 10Effie Mouzeli: redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976)
[10:07:51] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDropPercent: Core router normal + high priority queue drops are high on cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, ...
[10:07:51] <jinxer-wm>	 MAC filter) {#1016}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#CoreRouterInterfaceDropPercent - https://grafana.wikimedia.org/d/5p97dAASz/queue-and-error-stats-by-network-device?var-site=eqsin+prometheus%2Fops&var-device=cr3-eqsin&var-interface=xe-0%2F1%2F3 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDropPercent
[10:10:48] <wikibugs>	 (03PS1) 10CWilliams: instances.yaml: Decommissioning db2152.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287371 (https://phabricator.wikimedia.org/T424344)
[10:11:21] <jinxer-wm>	 FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf  - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[10:11:30] <bjensen>	 !ack
[10:11:31] <jinxer-wm>	 FIRING: Traffic on tunnel link: Alert for device cr1-drmrs.wikimedia.org - Traffic on tunnel link   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link
[10:11:31] <sirenbot>	 7930 (ACKED)  TransitPeeringTransportOutSaturation network sre (gnmi)
[10:14:02] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm
[10:14:09] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm
[10:14:35] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1069.eqiad.wmnet with OS bullseye
[10:15:21] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1063.eqiad.wmnet with OS bullseye
[10:15:37] <wikibugs>	 (03PS1) 10Cathal Mooney: wmf-netbox: add new bgp group mappings for dse-k8s-wdqs nodes [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1287372 (https://phabricator.wikimedia.org/T425653)
[10:16:28] <wikibugs>	 (03PS2) 10Cathal Mooney: wmf-netbox: add new bgp group mappings for dse-k8s-wdqs nodes [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1287372 (https://phabricator.wikimedia.org/T425653)
[10:16:31] <jinxer-wm>	 RESOLVED: Traffic on tunnel link: Device cr1-drmrs.wikimedia.org recovered from Traffic on tunnel link   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link
[10:17:58] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm
[10:18:30] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] "Similar to I64475fafdae90bc55ff3e8046dda48b85217594d" [puppet] - 10https://gerrit.wikimedia.org/r/1286775 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli)
[10:18:34] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:34] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:36] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:40] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:46] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:46] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:46] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:46] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:46] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:46] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:46] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:47] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:47] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:48] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:48] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:19:18] <logmsgbot>	 !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply
[10:19:23] <logmsgbot>	 !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply
[10:19:24] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:24] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:24] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[10:19:28] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:30] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:31] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1287372 (https://phabricator.wikimedia.org/T425653) (owner: 10Cathal Mooney)
[10:19:33] <jynus>	 what was that?
[10:19:36] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:36] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:36] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:36] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:36] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:36] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:36] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:37] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:37] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:38] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:38] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:39] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:20:01] <wikibugs>	 (03PS4) 10Effie Mouzeli: redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976)
[10:20:49] <wikibugs>	 (03PS5) 10Effie Mouzeli: redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976)
[10:21:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Create single Homer BGP group template to cover all variants - https://phabricator.wikimedia.org/T349116#11921022 (10cmooney) 05Open→03Declined Closing this one for now.  We do need to look a this, but also we need to review in light of having both Junip...
[10:21:21] <jinxer-wm>	 RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr4-ulsfo:xe-0/1/2 (Transport: cr2-eqsin:xe-0/1/4 (NTT, ...
[10:21:21] <jinxer-wm>	 369639) {#1076}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=ulsfo+prometheus%2Fops&var-device=cr4-ulsfo:9804&var-interface=xe-0%2F1%2F2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[10:21:26] <jynus>	 the blip seemed real, although not a lot of impact
[10:21:34] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] instances.yaml: Decommissioning db2152.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287371 (https://phabricator.wikimedia.org/T424344) (owner: 10CWilliams)
[10:21:35] <jynus>	 network would make sense as the culprit
[10:21:37] <hnowlan>	 I'm guessing that's related to the saturation
[10:21:43] <jynus>	 oh
[10:21:57] <wikibugs>	 (03CR) 10CWilliams: [C:03+2] instances.yaml: Decommissioning db2152.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287371 (https://phabricator.wikimedia.org/T424344) (owner: 10CWilliams)
[10:22:06] <jynus>	 I didn't know that was ongoing
[10:25:27] <logmsgbot>	 !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply
[10:25:32] <logmsgbot>	 !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply
[10:26:18] <wikibugs>	 06SRE, 10observability: Setup BGP monitoring for PyBal, including amount of prefixes - https://phabricator.wikimedia.org/T79124#11921068 (10cmooney) 05Open→03Resolved a:03cmooney I'm gonna close this one.  We now have alerting on this via the bgp stats exported via gnmi (see [[ https://gerrit.wikimed...
[10:26:33] <bjensen>	 cortobot: list
[10:26:46] <bjensen>	 oh whoops :D
[10:27:09] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1069.eqiad.wmnet with reason: host reimage
[10:27:26] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1063.eqiad.wmnet with reason: host reimage
[10:28:57] <wikibugs>	 (03PS2) 10Robertsky: throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295)
[10:29:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky)
[10:31:44] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[10:33:20] <wikibugs>	 (03PS1) 10Btullis: Configure rsyslog to forward 'dumps-http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087)
[10:33:29] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[10:33:44] <wikibugs>	 (03Merged) 10jenkins-bot: redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[10:33:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772#11921091 (10cmooney) 05Open→03Resolved a:03cmooney I'm going to close this one.   I think everyone is agreed cross-rack links to have an LVS peer...
[10:34:13] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1069.eqiad.wmnet with reason: host reimage
[10:34:17] <logmsgbot>	 !log jiji@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply
[10:34:31] <logmsgbot>	 !log jiji@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply
[10:38:13] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1063.eqiad.wmnet with reason: host reimage
[10:40:53] <logmsgbot>	 !log jiji@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'.
[10:41:33] <logmsgbot>	 !log jiji@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'.
[10:42:05] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[10:42:15] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[10:42:42] <federico3>	 looking
[10:43:07] <marostegui>	 cezmunsta: ^
[10:43:19] <federico3>	 it's db2152 being removed
[10:43:24] <marostegui>	 cezmunsta: once the puppet change is merged, you need to execute the dbctl command to remove it from dbctl in cumin
[10:43:52] <marostegui>	 cezmunsta: as mentioned here: https://wikitech.wikimedia.org/wiki/MariaDB/Decommissioning_a_DB_Host#Remove_the_host_from_dbctl
[10:44:19] <marostegui>	 cezmunsta: you can just run: sudo dbctl config commit -m "Remove HOSTNAME from dbctl TASKNUMBER" from cumin1003 for instance
[10:44:21] * cezmunsta Yep, but currently not resolving that hose :)
[10:44:25] <cezmunsta>	 *host
[10:44:41] <marostegui>	 cezmunsta: do you want me to run it for you so it doesn't get hanging there for long?
[10:44:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11921120 (10cmooney) 05Open→03Resolved Ok this is rolled out and working.  I have tried to update our dashboards wher...
[10:44:58] * cezmunsta <marostegui> : yes please
[10:45:05] <marostegui>	 cezmunsta: doing it!
[10:45:09] <cezmunsta>	 ty
[10:45:22] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove db2152 from dbctl T424344', diff saved to https://phabricator.wikimedia.org/P92523 and previous config saved to /var/cache/conftool/dbconfig/20260514-104521-marostegui.json
[10:45:25] <stashbot>	 T424344: decommission db2152.codfw.wmnet - https://phabricator.wikimedia.org/T424344
[10:45:27] <marostegui>	 done!
[10:47:05] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[10:47:15] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[10:49:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11921125 (10cmooney) {F81386939 width=600}
[10:49:29] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1069.eqiad.wmnet with OS bullseye
[10:50:21] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11921128 (10MatthewVernon)
[10:53:25] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1063.eqiad.wmnet with OS bullseye
[10:53:55] <logmsgbot>	 !log jiji@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply
[10:53:58] <logmsgbot>	 !log jiji@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply
[10:56:02] <wikibugs>	 (03PS1) 10Federico Ceratto: cumin: Install pydantic and httpx packages [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926)
[10:56:02] <wikibugs>	 (03CR) 10Federico Ceratto: "(as discussed on IRC with elukey)" [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926) (owner: 10Federico Ceratto)
[10:56:06] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] wmf-netbox: add new bgp group mappings for dse-k8s-wdqs nodes [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1287372 (https://phabricator.wikimedia.org/T425653) (owner: 10Cathal Mooney)
[10:56:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cumin: Install pydantic and httpx packages [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926) (owner: 10Federico Ceratto)
[10:57:14] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1286311 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi)
[10:58:59] <wikibugs>	 (03PS3) 10Ayounsi: Replace upstream_speed with commit_rate [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1286311 (https://phabricator.wikimedia.org/T424839)
[11:00:11] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[11:00:26] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[11:00:53] <logmsgbot>	 !log jiji@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: sync
[11:01:04] <logmsgbot>	 !log jiji@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: sync
[11:01:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Replace upstream_speed with commit_rate [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1286311 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi)
[11:02:04] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm
[11:05:52] <wikibugs>	 (03PS2) 10Federico Ceratto: cumin: Install pydantic and httpx packages [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926)
[11:08:56] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[11:08:59] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[11:13:20] <wikibugs>	 (03PS3) 10Robertsky: throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295)
[11:14:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky)
[11:16:50] <wikibugs>	 (03PS4) 10Robertsky: throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295)
[11:17:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky)
[11:19:19] <wikibugs>	 (03PS5) 10Robertsky: throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295)
[11:19:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:19:41] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[11:19:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:19:47] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[11:20:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky)
[11:20:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:20:31] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[11:20:37] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift
[11:21:02] <wikibugs>	 (03CR) 10Cathal Mooney: "recheck" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1286311 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi)
[11:22:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:23:06] <wikibugs>	 (03PS6) 10Robertsky: throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295)
[11:25:08] <wikibugs>	 (03Abandoned) 10Sergio Gimeno: loggedOutWarning: set lastEditor used earlier [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285743 (https://phabricator.wikimedia.org/T425604) (owner: 10Sergio Gimeno)
[11:26:02] <wikibugs>	 (03CR) 10Chlod Alejandro: [C:03+1] throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky)
[11:26:35] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:26:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:31:12] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply
[11:31:22] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply
[11:48:13] <wikibugs>	 (03CR) 10Elukey: "The request is legit, I'll go through my team just to be sure and come back!" [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926) (owner: 10Federico Ceratto)
[11:54:48] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add pc2023 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1287384 (https://phabricator.wikimedia.org/T418973)
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1200)
[12:08:41] <jinxer-wm>	 FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:09:28] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Replace upstream_speed with commit_rate [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1286311 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi)
[12:10:33] <wikibugs>	 (03PS1) 10Atsuko: services_proxy: isetting up toolhub and ttmserver [puppet] - 10https://gerrit.wikimedia.org/r/1287388 (https://phabricator.wikimedia.org/T424248)
[12:14:39] <wikibugs>	 (03PS3) 10Federico Ceratto: cumin: Install pydantic and httpx packages [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926)
[12:15:30] <wikibugs>	 (03CR) 10Federico Ceratto: cumin: Install pydantic and httpx packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926) (owner: 10Federico Ceratto)
[12:15:36] <wikibugs>	 (03PS2) 10Effie Mouzeli: rest-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285344 (https://phabricator.wikimedia.org/T419976)
[12:16:53] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Add pc2023 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1287384 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui)
[12:17:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1287391 (owner: 10L10n-bot)
[12:18:16] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] rest-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285344 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[12:18:21] <jynus>	 another blip
[12:18:28] <jynus>	 on codfw upload
[12:18:39] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Add pc2023 to pc3 T418973', diff saved to https://phabricator.wikimedia.org/P92524 and previous config saved to /var/cache/conftool/dbconfig/20260514-121839-marostegui.json
[12:18:41] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:18:41] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:18:43] <stashbot>	 T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973
[12:18:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:18:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:18:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:18:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:18:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:18:49] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:18:50] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:18:50] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:18:51] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:18:51] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:18:52] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:19:15] <jynus>	 larger this time
[12:19:31] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:31] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:39] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:39] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:39] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:39] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:39] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:39] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:39] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:40] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:40] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:41] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:41] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.217 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:58] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Add pc2023 to pc3 codfw master T418973', diff saved to https://phabricator.wikimedia.org/P92525 and previous config saved to /var/cache/conftool/dbconfig/20260514-121958-marostegui.json
[12:20:24] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285344 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[12:20:53] <wikibugs>	 (03PS1) 10Marostegui: pc2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287395 (https://phabricator.wikimedia.org/T418973)
[12:21:15] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[12:21:32] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[12:21:35] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] pc2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287395 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui)
[12:22:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T426221#11921398 (10Jclark-ctr) Rebalanced pdu still monitoring   continuing to monitor
[12:22:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1287396 (owner: 10L10n-bot)
[12:24:43] <wikibugs>	 (03PS1) 10Marostegui: pc2023: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287397 (https://phabricator.wikimedia.org/T418973)
[12:26:01] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] pc2023: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287397 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui)
[12:27:08] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool pc3 with pc2023 as codfw master T418973', diff saved to https://phabricator.wikimedia.org/P92526 and previous config saved to /var/cache/conftool/dbconfig/20260514-122707-marostegui.json
[12:27:12] <stashbot>	 T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973
[12:27:36] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 28458
[12:28:48] <wikibugs>	 (03PS1) 10Btullis: Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087)
[12:31:13] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 28458
[12:33:13] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not format pc2023 [puppet] - 10https://gerrit.wikimedia.org/r/1287408 (https://phabricator.wikimedia.org/T418973)
[12:37:15] <wikibugs>	 (03CR) 10Nikerabbit: [C:04-1] "No code change?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287288 (owner: 10Abijeet Patro)
[12:39:02] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2026-05-14-123010-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287409 (https://phabricator.wikimedia.org/T426174)
[12:40:31] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: update bgp groups for dse-k8s-wdqs - cmooney@cumin1003
[12:42:10] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: update bgp groups for dse-k8s-wdqs - cmooney@cumin1003
[12:42:36] <wikibugs>	 (03CR) 10Sbisson: [C:03+1] Update cxserver to 2026-05-14-123010-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287409 (https://phabricator.wikimedia.org/T426174) (owner: 10KartikMistry)
[12:42:56] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[12:43:06] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2026-05-14-123010-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287409 (https://phabricator.wikimedia.org/T426174) (owner: 10KartikMistry)
[12:43:14] <kart_>	 Deploying cxserver.
[12:45:11] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2026-05-14-123010-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287409 (https://phabricator.wikimedia.org/T426174) (owner: 10KartikMistry)
[12:45:51] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] GraphQL: replace termination_z upstream_speed with commit_rate [software/homer] - 10https://gerrit.wikimedia.org/r/1286310 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi)
[12:46:40] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply
[12:47:03] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[12:47:23] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1279] - vriley@cumin1003"
[12:47:28] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1279] - vriley@cumin1003"
[12:47:28] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:47:55] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1279
[12:49:28] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[12:49:38] <wikibugs>	 (03PS1) 10Btullis: mediawiki-dumps-legacy: Allow launching dumps from airflow-wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287411 (https://phabricator.wikimedia.org/T422179)
[12:49:43] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T426291
[12:49:44] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1279
[12:49:46] <stashbot>	 T426291: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T426291
[12:50:15] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db2161 with weight 0 T426291', diff saved to https://phabricator.wikimedia.org/P92527 and previous config saved to /var/cache/conftool/dbconfig/20260514-125014-fceratto.json
[12:50:34] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1279.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:53:15] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1280] - vriley@cumin1003"
[12:53:21] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1280] - vriley@cumin1003"
[12:53:21] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:53:39] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1280
[12:54:42] <wikibugs>	 (03CR) 10Lerickson: [C:03+1] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287411 (https://phabricator.wikimedia.org/T422179) (owner: 10Btullis)
[12:54:55] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1280
[12:54:55] <wikibugs>	 (03PS2) 10Abijeet Patro: ULS rewrite: Enable on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287288 (https://phabricator.wikimedia.org/T426288)
[12:55:17] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[12:55:48] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply
[12:56:14] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1280.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:56:23] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[12:56:41] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:57:40] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[12:58:01] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921466 (10VRiley-WMF) 05Open→03In progress
[12:58:09] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1287361 (https://phabricator.wikimedia.org/T426291) (owner: 10Gerrit maintenance bot)
[12:58:15] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[12:59:05] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1281] - vriley@cumin1003"
[12:59:11] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1281] - vriley@cumin1003"
[12:59:11] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:59:21] <mfossati>	 o/
[12:59:32] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1281
[13:00:05] <jouncebot>	 Urbanecm and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1300).
[13:00:05] <jouncebot>	 annet, Nvdtn19, Krinkle, stephanebisson, mfossati, phuedx, and robertsky: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:17] <annet>	 o/
[13:00:20] <stephanebisson>	 o/
[13:00:21] <wikibugs>	 (03Merged) 10jenkins-bot: GraphQL: replace termination_z upstream_speed with commit_rate [software/homer] - 10https://gerrit.wikimedia.org/r/1286310 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi)
[13:00:32] <Krinkle>	 o/
[13:00:38] <kart_>	 !log Updated cxserver to 2026-05-14-123010-production (T426174, T404298)
[13:00:41] <federico3>	 !log Starting s8 codfw failover from db2165 to db2161 - T426291
[13:00:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:43] <stashbot>	 T426174: cxserver unit tests leak mediawiki api requests - https://phabricator.wikimedia.org/T426174
[13:00:43] <stashbot>	 T404298: Can't translate en:Tokyo in Gujarati - https://phabricator.wikimedia.org/T404298
[13:00:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:46] <stashbot>	 T426291: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T426291
[13:01:33] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1281
[13:01:44] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[13:01:49] <Krinkle>	 annet: stephanebisson: Do either of you want to self service, if a deployer isn't available?
[13:02:14] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db2161 to s8 primary T426291', diff saved to https://phabricator.wikimedia.org/P92528 and previous config saved to /var/cache/conftool/dbconfig/20260514-130213-fceratto.json
[13:02:17] <annet>	 Krinkle: I haven't done that before so would prefer not to
[13:02:22] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:03:26] <jinxer-wm>	 FIRING: [44x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:03:46] <stephanebisson>	 I can deploy my patch
[13:03:53] <stephanebisson>	 When it's my turn
[13:04:49] <Krinkle>	 stephanebisson: OK, I'd say go ahead. I'll can do annet  and mine after that.
[13:05:06] <annet>	 thanks!
[13:05:19] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:05:57] <stephanebisson>	 on it
[13:06:14] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1282] - vriley@cumin1003"
[13:06:19] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1282] - vriley@cumin1003"
[13:06:19] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:07:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287043 (https://phabricator.wikimedia.org/T426278) (owner: 10Sbisson)
[13:07:19] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1282
[13:07:32] <icinga-wm>	 PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator
[13:07:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Set correct weight T426291', diff saved to https://phabricator.wikimedia.org/P92529 and previous config saved to /var/cache/conftool/dbconfig/20260514-130743-fceratto.json
[13:07:47] <stashbot>	 T426291: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T426291
[13:08:00] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the Article Guidance experiment on simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287043 (https://phabricator.wikimedia.org/T426278) (owner: 10Sbisson)
[13:08:18] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2165: Repooling after switchover
[13:08:24] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1279.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:08:42] <logmsgbot>	 !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1287043|Enable the Article Guidance experiment on simplewiki (T426278)]]
[13:08:44] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1282
[13:08:45] <stashbot>	 T426278: Enable Article Guidance experiment on Simple English Wikipedia (simplewiki) - https://phabricator.wikimedia.org/T426278
[13:09:02] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:09:14] <wikibugs>	 (03CR) 10Ottomata: "Hm, what topic does this send to?  I suppose a generic rsylog topic?  Or is it a separate nginxdumps only topic?" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[13:09:52] <wikibugs>	 (03PS1) 10CWilliams: mariadb: Decommission db2152 [puppet] - 10https://gerrit.wikimedia.org/r/1287414
[13:09:53] <wikibugs>	 (03CR) 10Ottomata: "Oo, and does this send message to a topic combined with other rsyslog ECS formatted messages? If so, we have some thinking to do!" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[13:10:13] <icinga-wm>	 PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100%
[13:10:20] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2165: Repooling after switchover
[13:10:28] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[13:10:32] <logmsgbot>	 !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1287043|Enable the Article Guidance experiment on simplewiki (T426278)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:10:42] <wikibugs>	 (03PS2) 10CWilliams: mariadb: Decommission db2152 [puppet] - 10https://gerrit.wikimedia.org/r/1287414 (https://phabricator.wikimedia.org/T424344)
[13:11:30] <chlod>	 o/ robertsky doesn't seem to be available so i can supervise his patch instead
[13:12:04] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1282.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:12:08] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:12:41] <logmsgbot>	 !log sbisson@deploy1003 sbisson: Continuing with deployment
[13:13:17] <robertsky>	 hihi
[13:13:19] <robertsky>	 o/
[13:13:20] <chlod>	 oh there he is
[13:13:24] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287388 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[13:13:59] <wikibugs>	 (03CR) 10Ottomata: Configure nginx to log requests in ECS format to syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[13:14:23] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1280.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:14:41] <robertsky>	 hi Krinkle, can you help deploy my patch? apologies, was caught in the traffic back to the hotel.
[13:14:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] mariadb: Decommission db2152 [puppet] - 10https://gerrit.wikimedia.org/r/1287414 (https://phabricator.wikimedia.org/T424344) (owner: 10CWilliams)
[13:15:18] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1283] - vriley@cumin1003"
[13:15:23] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1283] - vriley@cumin1003"
[13:15:23] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:15:55] <wikibugs>	 (03PS1) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.11.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1287415
[13:16:15] <Krinkle>	 robertsky: np, once it's our turn I will look at yours. there are a few other patches before yours.
[13:16:32] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1287388 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[13:16:52] <logmsgbot>	 !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287043|Enable the Article Guidance experiment on simplewiki (T426278)]] (duration: 08m 10s)
[13:16:55] <stashbot>	 T426278: Enable Article Guidance experiment on Simple English Wikipedia (simplewiki) - https://phabricator.wikimedia.org/T426278
[13:16:57] <robertsky>	 thanks!
[13:17:32] <Krinkle>	 annet: would you like to deploy the backport and config simultanously or one after the other?
[13:17:41] <stephanebisson>	 My deployment is done
[13:17:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not format pc2023 [puppet] - 10https://gerrit.wikimedia.org/r/1287408 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui)
[13:17:57] <annet>	 Krinkle: simultaneous would be great
[13:18:00] <Krinkle>	 Okay
[13:18:05] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1283
[13:18:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285913 (https://phabricator.wikimedia.org/T422169) (owner: 10Anne Tomasevich)
[13:18:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286327 (https://phabricator.wikimedia.org/T422169) (owner: 10Anne Tomasevich)
[13:18:29] <wikibugs>	 (03PS1) 10Tiziano Fogli: slothslos/deploy: wrap cleanup command in Bash to allow brace expansion [puppet] - 10https://gerrit.wikimedia.org/r/1287416 (https://phabricator.wikimedia.org/T414579)
[13:18:39] <Krinkle>	 wmf.2 is everywhere according to https://versions.toolforge.org/
[13:18:40] <wikibugs>	 (03CR) 10Atsuko: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[13:18:41] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:18:45] <icinga-wm>	 RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 32.97 ms
[13:18:49] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:18:49] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:18:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:18:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:18:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:18:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:18:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:18:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:18:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:18:54] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:18:54] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:18:55] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:18:56] <Krinkle>	 stand by for testing...
[13:19:07] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:19:14] <wikibugs>	 (03Merged) 10jenkins-bot: Add ReadingLists Account Creation CTA campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285913 (https://phabricator.wikimedia.org/T422169) (owner: 10Anne Tomasevich)
[13:19:28] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1283
[13:19:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:19:39] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:19:39] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:19:43] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:19:43] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:19:43] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:19:43] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:19:43] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:19:43] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:19:43] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:19:44] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:19:44] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:19:45] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:20:37] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1283.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:21:01] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] slothslos/deploy: wrap cleanup command in Bash to allow brace expansion [puppet] - 10https://gerrit.wikimedia.org/r/1287416 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli)
[13:21:39] <Krinkle>	 https://integration.wikimedia.org/zuul/#q=wmf
[13:21:49] <wikibugs>	 (03PS12) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335
[13:21:49] <icinga-wm>	 PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100%
[13:22:45] <icinga-wm>	 RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 32.97 ms
[13:22:46] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:23:38] <wikibugs>	 (03PS13) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335
[13:24:57] <wikibugs>	 (03CR) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (owner: 10A-pizzata)
[13:25:11] <wikibugs>	 (03PS14) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335
[13:25:29] <wikibugs>	 (03PS15) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355)
[13:25:38] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: ml-services: add qwen36-27b to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680)
[13:29:10] <wikibugs>	 (03Merged) 10jenkins-bot: WelcomeSurvey: Respect returnTo for campaigns skipping the survey [extensions/GrowthExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286327 (https://phabricator.wikimedia.org/T422169) (owner: 10Anne Tomasevich)
[13:29:26] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1285913|Add ReadingLists Account Creation CTA campaign (T422169)]], [[gerrit:1286327|WelcomeSurvey: Respect returnTo for campaigns skipping the survey (T422169)]]
[13:29:29] <stashbot>	 T422169: Account Creation CTA experiment: handle experience after account creation - https://phabricator.wikimedia.org/T422169
[13:29:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CHANGELOG: add changelogs for release v0.11.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1287415 (owner: 10Cathal Mooney)
[13:29:50] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1279.eqiad.wmnet with OS bookworm
[13:29:59] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921600 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1279.eqiad.wmnet with OS bookworm
[13:30:34] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1282.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:31:13] <logmsgbot>	 !log krinkle@deploy1003 krinkle, annet: Backport for [[gerrit:1285913|Add ReadingLists Account Creation CTA campaign (T422169)]], [[gerrit:1286327|WelcomeSurvey: Respect returnTo for campaigns skipping the survey (T422169)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:31:23] <annet>	 testing...
[13:31:50] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2150: Host will be decommissioned
[13:32:10] <logmsgbot>	 !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2150: Host will be decommissioned
[13:32:25] <annet>	 Krinkle: LGTM, thanks again
[13:33:06] <logmsgbot>	 !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2151: Host will be decommissioned
[13:33:24] <wikibugs>	 (03CR) 10Cathal Mooney: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/1225579 (owner: 10Ayounsi)
[13:33:26] <logmsgbot>	 !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2151: Host will be decommissioned
[13:34:45] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] services_proxy: isetting up toolhub and ttmserver [puppet] - 10https://gerrit.wikimedia.org/r/1287388 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[13:36:04] <wikibugs>	 (03PS1) 10Bking: dse-k8s: roll back unnecessary TLS changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287419 (https://phabricator.wikimedia.org/T421293)
[13:36:27] <mfossati>	 Krinkle: I can self-deploy
[13:37:45] <Krinkle>	 annet: ok, proceeding
[13:37:47] <logmsgbot>	 !log krinkle@deploy1003 krinkle, annet: Continuing with deployment
[13:38:12] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1283.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:38:52] <Krinkle>	 mfossati: ack, there's one more at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1300 (my own). Then I can either do the one for robertsky, or we can swap to keep the order.
[13:39:26] <Krinkle>	 skipping Nvdtn19 and phuedx who haven't ack'ed yet to my knowledge.
[13:40:02] <wikibugs>	 (03CR) 10Krinkle: [C:03+2] Enable wgTrackMediaRequestProvenance on remaining Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269442 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle)
[13:40:17] <wikibugs>	 (03PS1) 10CWilliams: instances.yaml: Decommissioning db2151.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287420 (https://phabricator.wikimedia.org/T424343)
[13:40:28] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[13:40:57] <wikibugs>	 (03Merged) 10jenkins-bot: Enable wgTrackMediaRequestProvenance on remaining Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269442 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle)
[13:41:11] <wikibugs>	 (03CR) 10Atsuko: [C:03+1] dse-k8s: roll back unnecessary TLS changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287419 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking)
[13:41:50] <mfossati>	 Krinkle: all right, can we please keep the order?
[13:41:59] <icinga-wm>	 PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100%
[13:41:59] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285913|Add ReadingLists Account Creation CTA campaign (T422169)]], [[gerrit:1286327|WelcomeSurvey: Respect returnTo for campaigns skipping the survey (T422169)]] (duration: 12m 33s)
[13:42:03] <stashbot>	 T422169: Account Creation CTA experiment: handle experience after account creation - https://phabricator.wikimedia.org/T422169
[13:42:34] <Krinkle>	 mfossati: sure, np.
[13:42:53] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1269442|Enable wgTrackMediaRequestProvenance on remaining Wikipedias (T414338)]]
[13:42:55] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1280.eqiad.wmnet with OS bookworm
[13:42:57] <stashbot>	 T414338: FY25-26 WE5.4.12: Identify the provenance of image requests - https://phabricator.wikimedia.org/T414338
[13:43:06] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921648 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1280.eqiad.wmnet with OS bookworm
[13:43:45] <wikibugs>	 (03PS1) 10CWilliams: instances.yaml: Decommissioning db2150.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287421 (https://phabricator.wikimedia.org/T424342)
[13:44:07] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s: roll back unnecessary TLS changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287419 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking)
[13:44:15] <wikibugs>	 (03PS1) 10Bearloga: EventStreamConfig: fix product_metrics.web_base [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287422 (https://phabricator.wikimedia.org/T426209)
[13:44:41] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1269442|Enable wgTrackMediaRequestProvenance on remaining Wikipedias (T414338)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:45:11] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[13:45:23] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1279.eqiad.wmnet with reason: host reimage
[13:45:49] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with deployment
[13:46:03] <icinga-wm>	 RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 32.98 ms
[13:46:50] <wikibugs>	 (03PS1) 10Hnowlan: corto: set default visibility to WMF-NDA [puppet] - 10https://gerrit.wikimedia.org/r/1287424 (https://phabricator.wikimedia.org/T426137)
[13:47:36] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.provision: add sretest2010 to SUPERMICRO_NO_FQDN_MANAGEMENT [cookbooks] - 10https://gerrit.wikimedia.org/r/1287425
[13:47:55] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] instances.yaml: Decommissioning db2150.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287421 (https://phabricator.wikimedia.org/T424342) (owner: 10CWilliams)
[13:48:18] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] instances.yaml: Decommissioning db2151.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287420 (https://phabricator.wikimedia.org/T424343) (owner: 10CWilliams)
[13:48:31] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[13:49:14] <phuedx>	 Krinkle: My bad. I lost track of time. I'm here if there's still space
[13:49:19] <Krinkle>	 I recall our deployment calendar previously stating a maximum number of patches per window. It seems this is no longer there. However, per T225730 CI for anything other than a pure config patch is upto 20min these days. Anyway, it is what it is.
[13:49:20] <stashbot>	 T225730: Reduce runtime of MW shared gate Jenkins jobs to 5 min - https://phabricator.wikimedia.org/T225730
[13:49:52] <Krinkle>	 phuedx: ack, in a minute, mfossati is up. After that I'm doing robertsky and can try to do yours as well, but i'll be after the hour is up.
[13:49:53] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1279.eqiad.wmnet with reason: host reimage
[13:49:57] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269442|Enable wgTrackMediaRequestProvenance on remaining Wikipedias (T414338)]] (duration: 07m 03s)
[13:50:00] <stashbot>	 T414338: FY25-26 WE5.4.12: Identify the provenance of image requests - https://phabricator.wikimedia.org/T414338
[13:50:06] <icinga-wm>	 PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100%
[13:50:19] <Krinkle>	 mfossati: you're up :)
[13:50:23] <A_smart_kitten>	 Krinkle: https://wikitech.wikimedia.org/wiki/Backport_windows#Guidelines has the line "Our windows have a soft limit of 6 patches", that might be what you're remembering maybe? 
[13:50:45] <Krinkle>	 I see. It used to be on the calendar itself e.g. in the heading or caption at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1300
[13:50:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287363 (https://phabricator.wikimedia.org/T426247) (owner: 10Marco Fossati)
[13:50:48] <taavi>	 no, that used to be right on the calendar
[13:50:59] <A_smart_kitten>	 ah, ack
[13:51:04] <Krinkle>	 A_smart_kitten: anyway, thanks for finding that. Good to know
[13:51:26] <taavi>	 it was removed in a task I can't immediately find after someone (me iirc) pointed out that no-one followed that, on the argument that a patch limit is not the good thing to measure because a single sync can do multiple patches
[13:51:53] <Krinkle>	 fair enough. and if it's all config patches, one could do them in 5-10min each
[13:51:53] <wikibugs>	 (03PS2) 10Tiziano Fogli: thanos/compact: avoid constant Puppet changes [puppet] - 10https://gerrit.wikimedia.org/r/1273762 (https://phabricator.wikimedia.org/T386911)
[13:52:06] <Krinkle>	 realistically 6 people is perhaps a better limit than number of patches
[13:52:10] <wikibugs>	 (03Merged) 10jenkins-bot: dse-k8s: roll back unnecessary TLS changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287419 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking)
[13:52:11] <A_smart_kitten>	 https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/43
[13:52:30] <Krinkle>	 we have 7 this time, and given mulitpole involve a MW patch, that'll take 1.5h in total 
[13:52:42] <logmsgbot>	 !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2150.codfw.wmnet with reason: Depooled host, will be decommissioned
[13:53:04] <logmsgbot>	 !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Depooled host, will be decommissioned
[13:53:07] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance
[13:53:10] <Krinkle>	 (7ppl, 8 patches)
[13:53:15] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2165 (T419635)', diff saved to https://phabricator.wikimedia.org/P92533 and previous config saved to /var/cache/conftool/dbconfig/20260514-135315-fceratto.json
[13:53:20] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[13:53:26] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[13:53:28] <logmsgbot>	 !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2152.codfw.wmnet with reason: Depooled host, will be decommissioned
[13:53:28] <wikibugs>	 (03Merged) 10jenkins-bot: Scale share-highlight card to fit small viewports [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287363 (https://phabricator.wikimedia.org/T426247) (owner: 10Marco Fossati)
[13:53:44] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[13:53:46] <logmsgbot>	 !log mfossati@deploy1003 Started scap sync-world: Backport for [[gerrit:1287363|Scale share-highlight card to fit small viewports (T426247)]]
[13:53:50] <stashbot>	 T426247: Share Highlight: Ensure dialog header is visible on small devices - https://phabricator.wikimedia.org/T426247
[13:54:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11921702 (10Jhancock.wm)
[13:54:16] <Krinkle>	 now that the addition is tool-assisted, one could compute an estimate, e.g. per person, assign a 5-min or 20min estimate (if it includes a MW patch), and once it is >= 60min, don't advertise the window anymore as available.
[13:54:25] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[13:54:37] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[13:54:50] <icinga-wm>	 RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 32.95 ms
[13:55:10] <mfossati>	 Krinkle: +1 to that
[13:55:14] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica Lag: backup1-eqiad on db1205 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 560.13 seconds Jcrespo expected - The acknowledgement expires at: 2026-05-15 18:54:58. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:55:34] <logmsgbot>	 !log mfossati@deploy1003 mfossati: Backport for [[gerrit:1287363|Scale share-highlight card to fit small viewports (T426247)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:55:41] <James_F>	 Krinkle: And a 45 min if an MW patch that touches i18n?
[13:56:01] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[13:56:03] <mfossati>	 testing, please hold on
[13:56:04] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie
[13:56:19] <taavi>	 5 mins for a config patch seems a bit tight
[13:56:22] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[13:56:23] <taavi>	 but otherwise +1 to the idea
[13:56:27] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T419635)', diff saved to https://phabricator.wikimedia.org/P92534 and previous config saved to /var/cache/conftool/dbconfig/20260514-135626-fceratto.json
[13:56:42] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[13:56:44] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hosts.provision: add sretest2010 to SUPERMICRO_NO_FQDN_MANAGEMENT [cookbooks] - 10https://gerrit.wikimedia.org/r/1287425 (owner: 10Elukey)
[13:56:46] <James_F>	 Used to be 45 seconds for config patches.
[13:56:49] <logmsgbot>	 !log mfossati@deploy1003 mfossati: Continuing with deployment
[13:56:50] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[13:56:54] * James_F shakes his cane at the passing of the times.
[13:57:04] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[13:57:29] <Krinkle>	 taavi: ack. not saying it'll be enforced at runtime, just about whether or not to prevent scheduling. I'd rather the tool allow too many than too few and become circimvented/ignored.
[13:58:04] <Krinkle>	 James_F: hehe, hopefully not for much longer once we get the new l10n format live.
[13:58:14] <James_F>	 Krinkle: We'll see.
[13:58:23] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1280.eqiad.wmnet with reason: host reimage
[13:58:35] <wikibugs>	 (03CR) 10CWilliams: [C:03+2] instances.yaml: Decommissioning db2151.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287420 (https://phabricator.wikimedia.org/T424343) (owner: 10CWilliams)
[13:59:24] <robertsky>	 really sorry for the last minute patch. we got the ip addresses only today. >.< and the conference is tomorrow.
[13:59:46] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1282.eqiad.wmnet with OS bookworm
[13:59:58] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921768 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1282.eqiad.wmnet with OS bookworm
[14:00:04] <wikibugs>	 (03PS1) 10Btullis: Add support for creating arbitrary PVCs to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287426 (https://phabricator.wikimedia.org/T422179)
[14:00:25] <wikibugs>	 (03PS1) 10Sbisson: Simplewiki: include article wizard in AG experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287427 (https://phabricator.wikimedia.org/T426278)
[14:00:56] <logmsgbot>	 !log mfossati@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287363|Scale share-highlight card to fit small viewports (T426247)]] (duration: 07m 09s)
[14:00:59] <stashbot>	 T426247: Share Highlight: Ensure dialog header is visible on small devices - https://phabricator.wikimedia.org/T426247
[14:01:05] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287427 (https://phabricator.wikimedia.org/T426278) (owner: 10Sbisson)
[14:01:05] <mfossati>	 done!
[14:01:11] <logmsgbot>	 !log cwilliams@cumin1003 dbctl commit (dc=all): 'Remove db2151 from dbctl T424343', diff saved to https://phabricator.wikimedia.org/P92535 and previous config saved to /var/cache/conftool/dbconfig/20260514-140110-cwilliams.json
[14:01:14] <stashbot>	 T424343: decommission db2151.codfw.wmnet - https://phabricator.wikimedia.org/T424343
[14:01:16] <wikibugs>	 (03CR) 10Tiziano Fogli: thanos/compact: avoid constant Puppet changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1273762 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli)
[14:02:30] <wikibugs>	 (03PS2) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.11.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1287415
[14:03:24] <Krinkle>	 robertsky: reviewing yours now
[14:03:38] <logmsgbot>	 elukey@cumin1003 reimage (PID 3911913) is awaiting input
[14:03:42] <wikibugs>	 06SRE, 10corto, 10Incident Tooling, 13Patch-For-Review: Increase trusted volunteer's visibility into production incidents - https://phabricator.wikimedia.org/T426137#11921785 (10A_smart_kitten) (cross-referencing to {T389664}, where the default visibility of incident tasks was previously discussed FWICS)
[14:04:05] <wikibugs>	 (03PS1) 10Atsuko: services_proxy: enabling toolhub and ttmserver [puppet] - 10https://gerrit.wikimedia.org/r/1287428 (https://phabricator.wikimedia.org/T424248)
[14:04:35] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1280.eqiad.wmnet with reason: host reimage
[14:04:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky)
[14:05:15] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie
[14:05:48] <wikibugs>	 (03Merged) 10jenkins-bot: throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky)
[14:06:06] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1287367|throttle rule for ESEAP Conference 2026 15-18 May 2026 (T426295)]]
[14:06:10] <stashbot>	 T426295: Request throttle exemption of IP addresses for ESEAP Conference 2026 - https://phabricator.wikimedia.org/T426295
[14:06:24] <robertsky>	 Krinkle, will need to run maintenance script: https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold#Reset
[14:06:36] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P92536 and previous config saved to /var/cache/conftool/dbconfig/20260514-140635-fceratto.json
[14:06:57] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[14:07:06] <robertsky>	 to clear the cache.
[14:07:15] <wikibugs>	 (03CR) 10Btullis: "Good question. I checked the output from this command:" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[14:07:32] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[14:07:33] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1279.eqiad.wmnet with OS bookworm
[14:07:41] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1279.eqiad.wmnet with OS bookworm completed: - db1279 (**PASS**)   -...
[14:07:58] <logmsgbot>	 !log krinkle@deploy1003 krinkle, robertsky: Backport for [[gerrit:1287367|throttle rule for ESEAP Conference 2026 15-18 May 2026 (T426295)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:08:22] <robertsky>	 nothing to test...
[14:08:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11921799 (10Jhancock.wm)
[14:08:32] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11921800 (10Jhancock.wm)
[14:08:42] <wikibugs>	 (03CR) 10Novem Linguae: "Does CortoBot need to be added to WMF-NDA before this patch is merged? This might avoid a similar issue to what happened in T389664#106960" [puppet] - 10https://gerrit.wikimedia.org/r/1287424 (https://phabricator.wikimedia.org/T426137) (owner: 10Hnowlan)
[14:09:06] <robertsky>	 test servers accessible
[14:09:40] <logmsgbot>	 !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply
[14:09:43] <logmsgbot>	 !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply
[14:09:57] <logmsgbot>	 !log krinkle@deploy1003 krinkle, robertsky: Continuing with deployment
[14:10:02] <Krinkle>	 ack
[14:10:53] <Krinkle>	 phuedx: are you okay self-servicing?
[14:11:54] <wikibugs>	 (03PS2) 10CWilliams: instances.yaml: Decommissioning db2150.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287421 (https://phabricator.wikimedia.org/T424342)
[14:12:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11921818 (10Jhancock.wm)
[14:12:35] <wikibugs>	 (03CR) 10Btullis: "Yes, confirmed:" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[14:13:04] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[14:14:07] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287367|throttle rule for ESEAP Conference 2026 15-18 May 2026 (T426295)]] (duration: 08m 00s)
[14:14:10] <stashbot>	 T426295: Request throttle exemption of IP addresses for ESEAP Conference 2026 - https://phabricator.wikimedia.org/T426295
[14:14:56] <ottomata>	 Hi, I have a no-op config change to deploy, let me know when then the window is clear!  Krinkle it looks liek you are waiting for phuedx ?
[14:15:06] <phuedx>	 Krinkle: Can do
[14:15:25] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1282.eqiad.wmnet with reason: host reimage
[14:16:35] <wikibugs>	 (03CR) 10CWilliams: [C:03+2] instances.yaml: Decommissioning db2150.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287421 (https://phabricator.wikimedia.org/T424342) (owner: 10CWilliams)
[14:16:45] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P92537 and previous config saved to /var/cache/conftool/dbconfig/20260514-141644-fceratto.json
[14:17:04] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1284] - vriley@cumin1003"
[14:17:10] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1284] - vriley@cumin1003"
[14:17:10] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:17:25] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1284
[14:18:13] <logmsgbot>	 !log cwilliams@cumin1003 dbctl commit (dc=all): 'Remove db2150 from dbctl T424342', diff saved to https://phabricator.wikimedia.org/P92538 and previous config saved to /var/cache/conftool/dbconfig/20260514-141812-cwilliams.json
[14:18:17] <stashbot>	 T424342: decommission db2150.codfw.wmnet - https://phabricator.wikimedia.org/T424342
[14:18:18] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1283.eqiad.wmnet with OS bookworm
[14:18:24] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1283.eqiad.wmnet with OS bookworm
[14:18:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CHANGELOG: add changelogs for release v0.11.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1287415 (owner: 10Cathal Mooney)
[14:18:51] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[14:18:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[14:18:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[14:18:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[14:18:53] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[14:18:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[14:18:53] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[14:18:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[14:18:54] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1282.eqiad.wmnet with reason: host reimage
[14:18:54] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[14:18:54] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[14:18:55] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[14:18:55] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[14:18:56] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[14:19:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287368 (https://phabricator.wikimedia.org/T419514) (owner: 10Phuedx)
[14:19:15] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance
[14:19:23] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2165 (T419961)', diff saved to https://phabricator.wikimedia.org/P92539 and previous config saved to /var/cache/conftool/dbconfig/20260514-141922-fceratto.json
[14:19:36] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1284
[14:19:41] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[14:19:43] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift
[14:19:43] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift
[14:19:43] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[14:19:43] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift
[14:19:43] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[14:19:43] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[14:19:44] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[14:19:44] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift
[14:19:45] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift
[14:19:45] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Swift
[14:19:46] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Swift
[14:19:47] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift
[14:20:12] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1284.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:20:15] <wikibugs>	 (03CR) 10Hnowlan: "corto is already a member of WMF-NDA: https://phabricator.wikimedia.org/project/members/61/" [puppet] - 10https://gerrit.wikimedia.org/r/1287424 (https://phabricator.wikimedia.org/T426137) (owner: 10Hnowlan)
[14:21:33] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[14:22:21] <wikibugs>	 (03CR) 10Novem Linguae: "Looks like it's the 6th project and I only looked at the first 5. Naturally :) Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1287424 (https://phabricator.wikimedia.org/T426137) (owner: 10Hnowlan)
[14:22:28] <wikibugs>	 (03Merged) 10jenkins-bot: ext.wikimediaEvents: Add synth-aa-ncs-1 experiment [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287368 (https://phabricator.wikimedia.org/T419514) (owner: 10Phuedx)
[14:22:32] <robertsky>	 Krinkle, can you run the following comannds to clear the cache for the throttle? iirc, it takes 3 days for the cache to expire, if any? https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold#Reset
[14:22:44] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1287368|ext.wikimediaEvents: Add synth-aa-ncs-1 experiment (T419514)]]
[14:22:48] <stashbot>	 T419514: Run a synthetic A/A non-cache-splitting experiment - https://phabricator.wikimedia.org/T419514
[14:23:02] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[14:23:03] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1280.eqiad.wmnet with OS bookworm
[14:23:09] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1280.eqiad.wmnet with OS bookworm completed: - db1280 (**PASS**)   -...
[14:24:23] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT
[14:24:31] <logmsgbot>	 !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1287368|ext.wikimediaEvents: Add synth-aa-ncs-1 experiment (T419514)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:24:45] <Krinkle>	 robertsky: https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/b474e435e552b95feffaacf79d8553f0d1766545/wmf-config/InitialiseSettings.php#2406 I don't see any limit there longer than 24h
[14:25:13] <Krinkle>	 in any event, afaik the limit itself is not part the cache, only the current count
[14:25:24] <robertsky>	 ok
[14:25:26] <Krinkle>	 clearing the cache is about resetting it if you need more than the curently configured limit without increasing it
[14:25:36] <Krinkle>	 i.e. instead of a patch like the one you made.
[14:25:49] <robertsky>	 ok
[14:25:59] <robertsky>	 thanks!
[14:26:10] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie
[14:26:21] <Krinkle>	 yw
[14:26:51] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T419961)', diff saved to https://phabricator.wikimedia.org/P92540 and previous config saved to /var/cache/conftool/dbconfig/20260514-142650-fceratto.json
[14:27:06] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: ml-services: add qwen36-27b to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680)
[14:27:29] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[14:29:30] <phuedx>	 JavaScript console looks clear on regular navigation. No ResourceLoader errors etc. Continuing
[14:29:46] <logmsgbot>	 !log phuedx@deploy1003 phuedx: Continuing with deployment
[14:30:04] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1430)
[14:31:46] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1285] - vriley@cumin1003"
[14:31:52] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1285] - vriley@cumin1003"
[14:31:52] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:32:16] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1285
[14:33:05] <wikibugs>	 (03CR) 10Dragoniez: "@fd7ezs8cx@mozmail.com When the backport window opens you must be available in IRC's #wikimedia-operations channel (see https://wikitech.w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (https://phabricator.wikimedia.org/T405724) (owner: 10Nvdtn19)
[14:33:47] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1285
[14:33:48] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1283.eqiad.wmnet with reason: host reimage
[14:33:56] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[14:33:58] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287368|ext.wikimediaEvents: Add synth-aa-ncs-1 experiment (T419514)]] (duration: 11m 14s)
[14:34:02] <stashbot>	 T419514: Run a synthetic A/A non-cache-splitting experiment - https://phabricator.wikimedia.org/T419514
[14:35:20] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[14:35:23] <wikibugs>	 (03CR) 10Federico Ceratto: "Yes the test ran fine." [cookbooks] - 10https://gerrit.wikimedia.org/r/1285355 (https://phabricator.wikimedia.org/T425417) (owner: 10Federico Ceratto)
[14:35:37] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[14:35:38] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1282.eqiad.wmnet with OS bookworm
[14:35:43] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921951 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1282.eqiad.wmnet with OS bookworm completed: - db1282 (**PASS**)   -...
[14:36:33] <phuedx>	 ottomata: Done!
[14:36:42] <phuedx>	  /cc Krinkle 
[14:37:00] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P92541 and previous config saved to /var/cache/conftool/dbconfig/20260514-143659-fceratto.json
[14:37:02] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1289
[14:37:03] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host db1289
[14:37:38] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: add qwen36-27b to experimental ns (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) (owner: 10Ilias Sarantopoulos)
[14:37:41] <logmsgbot>	 vriley@cumin1003 provision (PID 3939945) is awaiting input
[14:38:40] <wikibugs>	 (03PS5) 10Ilias Sarantopoulos: ml-services: add qwen36-27b to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680)
[14:38:42] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1287] - vriley@cumin1003"
[14:38:48] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1287] - vriley@cumin1003"
[14:38:48] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:38:54] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1284.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:39:03] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1283.eqiad.wmnet with reason: host reimage
[14:39:44] <wikibugs>	 (03PS6) 10Ilias Sarantopoulos: ml-services: add qwen36-27b to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680)
[14:40:23] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: add qwen36-27b to experimental ns (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) (owner: 10Ilias Sarantopoulos)
[14:41:27] <wikibugs>	 (03CR) 10Dragoniez: "See also https://wikitech.wikimedia.org/wiki/Backport_windows#How_to_submit_a_patch_for_backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (https://phabricator.wikimedia.org/T405724) (owner: 10Nvdtn19)
[14:42:46] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+1] EventStreamConfig: fix product_metrics.web_base [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287422 (https://phabricator.wikimedia.org/T426209) (owner: 10Bearloga)
[14:42:53] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: FY2526 Q3 ulsfo: switch refresh - https://phabricator.wikimedia.org/T408510#11921972 (10RobH)
[14:44:15] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1285.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:45:48] <wikibugs>	 (03PS3) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.11.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1287415
[14:46:51] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1287.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:47:07] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P92542 and previous config saved to /var/cache/conftool/dbconfig/20260514-144707-fceratto.json
[14:47:18] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:47:20] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:48:14] <wikibugs>	 (03PS1) 10Codename Noreste: Restrict the changetags user right to bots and sysops on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287433 (https://phabricator.wikimedia.org/T355445)
[14:49:45] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[14:49:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[14:51:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287433 (https://phabricator.wikimedia.org/T355445) (owner: 10Codename Noreste)
[14:52:23] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1284.eqiad.wmnet with OS bookworm
[14:52:29] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1284.eqiad.wmnet with OS bookworm
[14:53:44] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1288] - vriley@cumin1003"
[14:53:50] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1288] - vriley@cumin1003"
[14:53:50] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:53:53] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[14:54:06] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1288
[14:54:08] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage
[14:54:30] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[14:54:31] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1283.eqiad.wmnet with OS bookworm
[14:54:36] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922009 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1283.eqiad.wmnet with OS bookworm completed: - db1283 (**PASS**)   -...
[14:54:51] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[14:55:06] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1285.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:55:18] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1288
[14:56:19] <logmsgbot>	 vriley@cumin1003 provision (PID 3948521) is awaiting input
[14:57:15] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T419961)', diff saved to https://phabricator.wikimedia.org/P92544 and previous config saved to /var/cache/conftool/dbconfig/20260514-145715-fceratto.json
[14:57:19] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1288.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:58:49] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1287.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:59:40] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287428 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[14:59:50] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage
[14:59:51] <wikibugs>	 (03CR) 10Bking: [C:03+1] services_proxy: enabling toolhub and ttmserver [puppet] - 10https://gerrit.wikimedia.org/r/1287428 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[15:00:04] <jouncebot>	 andre and brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1500).
[15:00:44] <andre>	 jouncebot: well, European Train log triage was six hours ago according to the Google calendar
[15:01:43] <wikibugs>	 (03CR) 10Bking: [C:03+1] "Feel free to add `opensearch-ttmserver` and `opensearch-toolhub` as well, not just the `-test` versions." [puppet] - 10https://gerrit.wikimedia.org/r/1287428 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[15:04:02] <wikibugs>	 (03CR) 10Thcipriani: "I think this is the problem you're seeing in devtools deploying phab to an upgraded host. Without a puppetserver there, it's hard to verif" [puppet] - 10https://gerrit.wikimedia.org/r/1287023 (https://phabricator.wikimedia.org/T424055) (owner: 10Thcipriani)
[15:04:24] <wikibugs>	 (03CR) 10Ottomata: "Hm, this could be an issue. Will every message in the topic match the ECS schema?  If not, we'll have to filter the topic for the specific" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[15:05:40] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] CHANGELOG: add changelogs for release v0.11.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1287415 (owner: 10Cathal Mooney)
[15:07:28] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1285.eqiad.wmnet with OS bookworm
[15:07:43] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922056 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1285.eqiad.wmnet with OS bookworm
[15:08:05] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1287.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:08:13] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1284.eqiad.wmnet with reason: host reimage
[15:08:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bearloga@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287422 (https://phabricator.wikimedia.org/T426209) (owner: 10Bearloga)
[15:08:43] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: add qwen36-27b to experimental ns (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) (owner: 10Ilias Sarantopoulos)
[15:09:00] <wikibugs>	 (03CR) 10BryanDavis: "Cause of https://phabricator.wikimedia.org/T425687 "No Puppet resources found on instance deployment-mx04 on project deployment-prep"" [puppet] - 10https://gerrit.wikimedia.org/r/1283025 (https://phabricator.wikimedia.org/T325394) (owner: 10Muehlenhoff)
[15:09:51] <jinxer-wm>	 RESOLVED: [3x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:09:57] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig: fix product_metrics.web_base [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287422 (https://phabricator.wikimedia.org/T426209) (owner: 10Bearloga)
[15:10:14] <logmsgbot>	 !log bearloga@deploy1003 Started scap sync-world: Backport for [[gerrit:1287422|EventStreamConfig: fix product_metrics.web_base (T426209)]]
[15:10:17] <stashbot>	 T426209: Explicitly declare absence of contextual attributes in product_metrics.web_base stream - https://phabricator.wikimedia.org/T426209
[15:12:01] <logmsgbot>	 !log bearloga@deploy1003 bearloga: Backport for [[gerrit:1287422|EventStreamConfig: fix product_metrics.web_base (T426209)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:12:07] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1287.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:12:21] <logmsgbot>	 !log bearloga@deploy1003 bearloga: Continuing with deployment
[15:14:43] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1284.eqiad.wmnet with reason: host reimage
[15:14:58] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] services_proxy: enabling toolhub and ttmserver [puppet] - 10https://gerrit.wikimedia.org/r/1287428 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[15:15:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11922147 (10Jhancock.wm)
[15:16:13] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1288.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:16:34] <logmsgbot>	 !log bearloga@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287422|EventStreamConfig: fix product_metrics.web_base (T426209)]] (duration: 06m 20s)
[15:16:37] <stashbot>	 T426209: Explicitly declare absence of contextual attributes in product_metrics.web_base stream - https://phabricator.wikimedia.org/T426209
[15:18:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:18:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:18:57] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:18:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:18:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:18:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:18:57] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:18:58] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:18:58] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:18:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:19:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:19:47] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:19:47] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:19:47] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:19:47] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:19:47] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:19:47] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:19:48] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:19:48] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:19:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:19:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:19:51] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:20:07] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.11.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1287415 (owner: 10Cathal Mooney)
[15:22:13] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] .gitignore: Add /static/hcaptcha/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287026 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy)
[15:23:14] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1285.eqiad.wmnet with reason: host reimage
[15:25:17] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie
[15:29:26] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1285.eqiad.wmnet with reason: host reimage
[15:29:54] <wikibugs>	 (03CR) 10Btullis: [C:03+2] mediawiki-dumps-legacy: Allow launching dumps from airflow-wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287411 (https://phabricator.wikimedia.org/T422179) (owner: 10Btullis)
[15:31:51] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: Allow launching dumps from airflow-wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287411 (https://phabricator.wikimedia.org/T422179) (owner: 10Btullis)
[15:31:56] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[15:32:38] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[15:32:39] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1284.eqiad.wmnet with OS bookworm
[15:32:44] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922233 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1284.eqiad.wmnet with OS bookworm completed: - db1284 (**PASS**)   -...
[15:33:36] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1288.eqiad.wmnet with OS bookworm
[15:33:42] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1288.eqiad.wmnet with OS bookworm
[15:35:16] <wikibugs>	 (03PS1) 10Cathal Mooney: Release v0.11.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1287436
[15:35:24] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[15:37:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim with Postfix on mail servers - https://phabricator.wikimedia.org/T325394#11922256 (10bd808)
[15:40:43] <wikibugs>	 (03PS2) 10Cathal Mooney: Release v0.11.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1287436
[15:41:11] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1289] - vriley@cumin1003"
[15:41:17] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1289] - vriley@cumin1003"
[15:41:17] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:41:30] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1289
[15:42:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim with Postfix on mail servers - https://phabricator.wikimedia.org/T325394#11922304 (10bd808)
[15:42:51] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1289
[15:45:37] <wikibugs>	 10SRE-SLO, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q4): Grafana: deploy grafana-dashboard-reporter-app - https://phabricator.wikimedia.org/T425795#11922323 (10hnowlan)
[15:45:51] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[15:45:56] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:46:10] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Release v0.11.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1287436 (owner: 10Cathal Mooney)
[15:46:47] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1006.eqiad.wmnet with OS trixie
[15:47:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11922328 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host kafka-logging1006.eqiad.wmnet with OS tr...
[15:48:53] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[15:48:54] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1285.eqiad.wmnet with OS bookworm
[15:49:03] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1285.eqiad.wmnet with OS bookworm completed: - db1285 (**PASS**)   -...
[15:49:27] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: Release v0.11.2 - cmooney@cumin1003
[15:49:37] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1288.eqiad.wmnet with reason: host reimage
[15:50:05] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[15:51:07] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: Release v0.11.2 - cmooney@cumin1003
[15:52:21] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:53:19] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:54:49] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1288.eqiad.wmnet with reason: host reimage
[15:54:57] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1290] - vriley@cumin1003"
[15:55:03] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1290] - vriley@cumin1003"
[15:55:03] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:55:22] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1290
[15:56:42] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1290
[15:57:43] <wikibugs>	 (03CR) 10Btullis: "Yes, I agree. i think that this will be an issue." [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[15:57:47] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:59:12] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[15:59:16] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:59:18] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[16:00:05] <jouncebot>	 jhathaway and rzl: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1600).
[16:00:05] <jouncebot>	 Dreamy_Jazz: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:33] <rzl>	 o/ looking
[16:01:40] <rzl>	 Dreamy_Jazz: pretty foolproof from a puppet pov :) I'm not reviewing from a "maintenance script does the right thing" perspective but I figure you did
[16:01:49] <rzl>	 will you want to kick off a test run, or just let the next one happen on schedule?
[16:01:55] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] purge_securepoll: don't exclude private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1279281 (https://phabricator.wikimedia.org/T419309) (owner: 10Novem Linguae)
[16:03:59] <wikibugs>	 10SRE-swift-storage, 10Cloud-VPS (Quota-requests): Quota increase request for project swift - https://phabricator.wikimedia.org/T425975#11922386 (10taavi) 05Open→03Resolved a:03taavi
[16:04:05] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[16:06:45] <Dreamy_Jazz>	 \o
[16:06:51] <icinga-wm>	 RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator
[16:06:52] <Dreamy_Jazz>	 Sorry, didn't see pings until now
[16:06:57] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[16:07:08] <Dreamy_Jazz>	 Should be fine for the next one to happen on schedule
[16:07:17] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove records for deleted IPs esams,drmrs and magru - cmooney@cumin1003"
[16:07:29] <Dreamy_Jazz>	 And yes, I reviewed from a maintenance script does the right thing"
[16:07:31] <jinxer-wm>	 FIRING: Traffic on tunnel link: Alert for device cr1-drmrs.wikimedia.org - Traffic on tunnel link   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link
[16:07:34] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove INCLUDE statements for CR<->CR link networks no longer used [dns] - 10https://gerrit.wikimedia.org/r/1287439 (https://phabricator.wikimedia.org/T424611)
[16:07:38] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove records for deleted IPs esams,drmrs and magru - cmooney@cumin1003"
[16:07:38] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:07:50] <rzl>	 Dreamy_Jazz: great thanks :)
[16:08:10] <rzl>	 mw-cron had some existing undeployed diffs, just checking those before I push this out
[16:09:23] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:11:32] <rzl>	 aha, the diffs are https://gerrit.wikimedia.org/r/c/operations/puppet/+/1280431
[16:11:33] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[16:12:31] <jinxer-wm>	 RESOLVED: Traffic on tunnel link: Device cr1-drmrs.wikimedia.org recovered from Traffic on tunnel link   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link
[16:12:48] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[16:13:54] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[16:13:55] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1288.eqiad.wmnet with OS bookworm
[16:14:04] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:14:05] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922442 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1288.eqiad.wmnet with OS bookworm completed: - db1288 (**PASS**)   -...
[16:14:10] <wikibugs>	 (03PS1) 10Cathal Mooney: common.yaml: remove OSPF definitions for esams/drmrs/magru cr links [homer/public] - 10https://gerrit.wikimedia.org/r/1287440 (https://phabricator.wikimedia.org/T424611)
[16:15:38] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:15:57] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] common.yaml: remove OSPF definitions for esams/drmrs/magru cr links [homer/public] - 10https://gerrit.wikimedia.org/r/1287440 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney)
[16:16:06] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:16:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:17:19] <wikibugs>	 (03Merged) 10jenkins-bot: common.yaml: remove OSPF definitions for esams/drmrs/magru cr links [homer/public] - 10https://gerrit.wikimedia.org/r/1287440 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney)
[16:17:51] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[16:18:57] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[16:18:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[16:18:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[16:18:57] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[16:18:57] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[16:18:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[16:19:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[16:19:07] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[16:19:47] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift
[16:19:47] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift
[16:19:47] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift
[16:19:47] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Swift
[16:19:47] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Swift
[16:19:51] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[16:19:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 5.756 second response time https://wikitech.wikimedia.org/wiki/Swift
[16:19:57] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[16:19:59] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift
[16:20:32] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply
[16:20:51] <rzl>	 (no cronjobs in codfw but I'm deploying just to clean up the unapplied diffs in the networkpolicy )
[16:20:51] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922517 (10VRiley-WMF)
[16:20:56] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply
[16:21:00] <wikibugs>	 (03CR) 10AKhatun: [C:03+1] stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287366 (https://phabricator.wikimedia.org/T423920) (owner: 10JavierMonton)
[16:21:14] <topranks>	 !log disable core router direct link at magru now that traffic is flowing via switches T424611
[16:21:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:18] <stashbot>	 T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611
[16:21:30] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922522 (10VRiley-WMF)
[16:21:45] <rzl>	 Dreamy_Jazz: done!
[16:21:49] <rzl>	 puppet window complete
[16:21:51] <Dreamy_Jazz>	 Thanks
[16:22:21] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922528 (10VRiley-WMF) Looking into the other servers to see what issues they may be. Suspected wrong cable ports, cable issues or mislabed
[16:22:29] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922533 (10VRiley-WMF) 05In progress→03Open
[16:25:11] <topranks>	 !log disable core router direct link at drmrs now that traffic is flowing via switches T424611
[16:25:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:46] <topranks>	 !log disable core router direct link at esams now that traffic is flowing via switches T424611
[16:31:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:50] <stashbot>	 T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611
[16:33:20] <wikibugs>	 (03CR) 10Dzahn: "Thank you!  there is actually a puppetserver there. but I was hoping we could stop using it and switch phab test instances back to the glo" [puppet] - 10https://gerrit.wikimedia.org/r/1287023 (https://phabricator.wikimedia.org/T424055) (owner: 10Thcipriani)
[16:34:23] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:35:24] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287366 (https://phabricator.wikimedia.org/T423920) (owner: 10JavierMonton)
[16:35:50] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-logging1006.eqiad.wmnet with OS trixie
[16:36:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11922644 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host kafka-logging1006.eqiad.wmnet with OS trixie...
[16:36:08] <logmsgbot>	 !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply
[16:36:12] <logmsgbot>	 !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply
[16:37:16] <wikibugs>	 (03CR) 10Dzahn: "puppet compiler seems broken:  Failed to execute '/pdb/query/v4' on at least 1 of the following 'server_urls': https://pcc-worker1006.pupp" [puppet] - 10https://gerrit.wikimedia.org/r/1287023 (https://phabricator.wikimedia.org/T424055) (owner: 10Thcipriani)
[16:44:14] <wikibugs>	 (03CR) 10Anzx: "seems ok to keep it as it is" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah)
[16:44:17] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Phabricator: require config before scap [puppet] - 10https://gerrit.wikimedia.org/r/1287023 (https://phabricator.wikimedia.org/T424055) (owner: 10Thcipriani)
[16:48:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:48:38] <wikibugs>	 (03PS1) 10Dzahn: phabricator::migration: require config before scap [puppet] - 10https://gerrit.wikimedia.org/r/1287447 (https://phabricator.wikimedia.org/T424055)
[16:49:11] <logmsgbot>	 !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply
[16:49:16] <logmsgbot>	 !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply
[16:49:17] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "doing the same thing in the phabricator::migration class which we made to allow setting up new prod phabricator servers:  https://gerrit.w" [puppet] - 10https://gerrit.wikimedia.org/r/1287023 (https://phabricator.wikimedia.org/T424055) (owner: 10Thcipriani)
[16:53:07] <wikibugs>	 (03CR) 10Ottomata: "Hm, nice!  Could we do that to produce it to both $dc.$meta.stream and to rsyslog-$severity?  Doing so would get the data in logstash as w" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[16:53:10] <wikibugs>	 (03CR) 10Dzahn: [C:04-2] "probably breaks puppet" [puppet] - 10https://gerrit.wikimedia.org/r/1287447 (https://phabricator.wikimedia.org/T424055) (owner: 10Dzahn)
[16:53:26] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "unfortunately: nope.  dependency cycle." [puppet] - 10https://gerrit.wikimedia.org/r/1287023 (https://phabricator.wikimedia.org/T424055) (owner: 10Thcipriani)
[16:53:59] <wikibugs>	 (03PS1) 10Dzahn: Revert "Phabricator: require config before scap" [puppet] - 10https://gerrit.wikimedia.org/r/1287450
[16:56:34] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "Phabricator: require config before scap" [puppet] - 10https://gerrit.wikimedia.org/r/1287450 (owner: 10Dzahn)
[16:57:52] <wikibugs>	 (03CR) 10Ottomata: "I was looking for prior art here.  I recall that the `mediawiki.client.error` stream is produced to kafka logging clusters by eventgate-lo" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[16:58:06] <logmsgbot>	 !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T426298
[17:00:05] <jouncebot>	 bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1700)
[17:03:41] <jinxer-wm>	 FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:05:44] * bd808 checks for deployable increments
[17:06:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new line card in cr2-eqiad slot 0, move card from slot 1 to cr1-eqiad slot 0 and configure - https://phabricator.wikimedia.org/T426343 (10cmooney) 03NEW p:05Triage→03Medium
[17:06:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new line card in cr2-eqiad slot 0, move card from slot 1 to cr1-eqiad slot 0 and configure - https://phabricator.wikimedia.org/T426343#11922831 (10cmooney)
[17:06:58] <logmsgbot>	 !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T426298
[17:08:19] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container to 2026-05-11-122319-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287452
[17:08:42] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Remove INCLUDE statements for CR<->CR link networks no longer used [dns] - 10https://gerrit.wikimedia.org/r/1287439 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney)
[17:09:11] <logmsgbot>	 !log cmooney@dns2005 START - running authdns-update
[17:10:27] <logmsgbot>	 !log cmooney@dns2005 END - running authdns-update
[17:11:17] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2026-05-11-122319-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287452 (owner: 10BryanDavis)
[17:13:30] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container to 2026-05-11-122319-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287452 (owner: 10BryanDavis)
[17:14:01] <logmsgbot>	 !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T426298
[17:14:19] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:14:21] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:15:58] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:16:13] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:16:37] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:16:53] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:17:03] <wikibugs>	 06SRE, 10corto, 10Incident Tooling, 13Patch-For-Review: Increase trusted volunteer's visibility into production incidents - https://phabricator.wikimedia.org/T426137#11922867 (10hnowlan) Some initial context: The kinds of issues SRE are dealing with have changed significantly in the last ~year. Historicall...
[17:17:26] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:17:47] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:17:48] <wikibugs>	 (03PS2) 10Dzahn: zuul: replace user/group setup with systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1286999
[17:18:52] <bd808>	 That's all for my window.</window>
[17:19:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[17:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[17:19:50] <wikibugs>	 (03CR) 10Dzahn: "Currently the user has a home dir /home/zuul but it does not exist:" [puppet] - 10https://gerrit.wikimedia.org/r/1286999 (owner: 10Dzahn)
[17:19:51] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.233 second response time https://wikitech.wikimedia.org/wiki/Swift
[17:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[17:21:11] <wikibugs>	 (03CR) 10Dzahn: "we should talk to traffic about the port question" [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn)
[17:23:03] <logmsgbot>	 !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T426298
[17:24:34] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1285488/8560/codesearch9.codesearch.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn)
[17:25:42] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "let me know if you think the general idea is good/acceptable. then I will be bold to just merge it, no need to check the code details." [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn)
[17:26:18] <wikibugs>	 (03PS6) 10Jforrester: wikifunctions: Add releases function-evaluators in Rust, unused [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg)
[17:26:23] <wikibugs>	 (03PS7) 10Jforrester: wikifunctions: Add releases function-evaluators in Rust, unused [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg)
[17:26:39] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "for the future person seeing this: please do not punish me for having touched it last. just trying to be helpful with one particular incid" [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn)
[17:27:30] <wikibugs>	 (03PS8) 10Jforrester: wikifunctions: Add releases function-evaluators in Rust, unused [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg)
[17:27:30] <wikibugs>	 (03CR) 10Jforrester: wikifunctions: Add releases function-evaluators in Rust, unused (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg)
[17:28:44] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11922906 (10Dzahn) 05In progress→03Stalled
[17:28:48] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Add releases function-evaluators in Rust, unused [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg)
[17:28:54] <wikibugs>	 (03Abandoned) 10Dzahn: admin: add SSH key and kerberos for Annie Kim WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1284777 (https://phabricator.wikimedia.org/T420500) (owner: 10Dzahn)
[17:29:53] <wikibugs>	 (03CR) 10Dzahn: "stalled - waiting for feedback - moving back to WIP status" [puppet] - 10https://gerrit.wikimedia.org/r/1282395 (https://phabricator.wikimedia.org/T240266) (owner: 10Dzahn)
[17:30:58] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Add releases function-evaluators in Rust, unused [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg)
[17:31:14] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[17:32:06] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[17:38:40] <wikibugs>	 (03CR) 10BPirkle: "Ack" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286862 (https://phabricator.wikimedia.org/T426081) (owner: 10Gkyziridis)
[18:00:05] <jouncebot>	 andre and brennen: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1800).
[18:08:31] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[18:09:04] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1274.eqiad.wmnet with OS bookworm
[18:09:11] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1274.eqiad.wmnet with OS bookworm
[18:10:43] <wikibugs>	 (03PS5) 10Jdlrobson: Remove MinervaNightMode config after skin cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T415930) (owner: 10HakanIST)
[18:14:08] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[18:16:48] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:17:22] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:19:07] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[18:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[18:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[18:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[18:19:57] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift
[18:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[18:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift
[18:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift
[18:22:13] <logmsgbot>	 vriley@cumin1003 provision (PID 4001665) is awaiting input
[18:23:30] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[18:24:38] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT
[18:25:02] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1274.eqiad.wmnet with reason: host reimage
[18:25:59] <wikibugs>	 (03PS1) 10Andrew Bogott: magnum-cluster-api: try the latest version of container-api on the worker [puppet] - 10https://gerrit.wikimedia.org/r/1287457
[18:26:28] <wikibugs>	 (03PS1) 10Jdlrobson: Limit $wgThumbLimits to three options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287441 (https://phabricator.wikimedia.org/T426328)
[18:27:54] <wikibugs>	 (03PS2) 10Andrew Bogott: magnum-cluster-api: try the latest version of container-api on the worker [puppet] - 10https://gerrit.wikimedia.org/r/1287457
[18:29:37] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1274.eqiad.wmnet with reason: host reimage
[18:29:55] <wikibugs>	 (03CR) 10Andrew Bogott: [V:03+2] magnum-cluster-api: try the latest version of container-api on the worker [puppet] - 10https://gerrit.wikimedia.org/r/1287457 (owner: 10Andrew Bogott)
[18:30:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] magnum-cluster-api: try the latest version of container-api on the worker [puppet] - 10https://gerrit.wikimedia.org/r/1287457 (owner: 10Andrew Bogott)
[18:32:32] <logmsgbot>	 vriley@cumin1003 provision (PID 4001665) is awaiting input
[18:33:53] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923052 (10VRiley-WMF)
[18:36:53] <wikibugs>	 10SRE-Access-Requests, 06Data-Platform-SRE: Grant mcollins level 1 access to analytics-privatedata-users - https://phabricator.wikimedia.org/T426348#11923056 (10Ottomata)
[18:38:20] <wikibugs>	 10SRE-Access-Requests, 06Data-Platform-SRE: Grant mcollins level 1 access to analytics-privatedata-users - https://phabricator.wikimedia.org/T426348#11923058 (10Ahoelzl) approved.
[18:40:48] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:46:50] <wikibugs>	 10SRE-SLO, 10observability, 10Wikidata, 06Wikidata Platform Team, and 3 others: Update WDQS SLOs to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11923094 (10bking)
[18:47:05] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[18:50:10] <logmsgbot>	 vriley@cumin1003 reimage (PID 4001255) is awaiting input
[18:51:03] <wikibugs>	 (03CR) 10RLazarus: wikifunctions: Add releases function-evaluators in Rust, unused (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg)
[18:51:44] <wikibugs>	 (03PS1) 10RLazarus: wikifunctions: Remove noop OTEL_EXPORTER_OTLP_ENDPOINT from releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287465 (https://phabricator.wikimedia.org/T423627)
[18:53:57] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Remove noop OTEL_EXPORTER_OTLP_ENDPOINT from releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287465 (https://phabricator.wikimedia.org/T423627) (owner: 10RLazarus)
[18:56:10] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Remove noop OTEL_EXPORTER_OTLP_ENDPOINT from releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287465 (https://phabricator.wikimedia.org/T423627) (owner: 10RLazarus)
[18:57:57] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[18:58:00] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[18:58:59] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Add releases function-evaluators in Rust, unused (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg)
[19:02:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:06:50] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1281.eqiad.wmnet with OS bookworm
[19:06:58] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923131 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1281.eqiad.wmnet with OS bookworm
[19:07:16] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: add qwen36-27b to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) (owner: 10Ilias Sarantopoulos)
[19:07:56] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923136 (10VRiley-WMF)
[19:09:22] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: add qwen36-27b to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) (owner: 10Ilias Sarantopoulos)
[19:14:58] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[19:14:59] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1274.eqiad.wmnet with OS bookworm
[19:15:10] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923140 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1274.eqiad.wmnet with OS bookworm completed: - db1274 (**PASS**)   -...
[19:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[19:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[19:19:57] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[19:19:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[19:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[19:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift
[19:20:47] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift
[19:20:47] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift
[19:22:30] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[19:23:07] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1281.eqiad.wmnet with reason: host reimage
[19:24:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah)
[19:26:25] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1286] - vriley@cumin1003"
[19:26:31] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [db1286] - vriley@cumin1003"
[19:26:31] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:26:46] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1286
[19:28:03] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1286
[19:28:34] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1286.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:29:07] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1281.eqiad.wmnet with reason: host reimage
[19:38:13] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1287.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:45:47] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[19:46:22] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1286.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:48:52] <logmsgbot>	 vriley@cumin1003 reimage (PID 4009498) is awaiting input
[19:49:13] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[19:49:14] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1281.eqiad.wmnet with OS bookworm
[19:49:21] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923219 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1281.eqiad.wmnet with OS bookworm completed: - db1281 (**PASS**)   -...
[19:53:23] <wikibugs>	 (03CR) 10Bking: [C:03+1] IPReputation: Route opensearch_ipoid through envoy service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286804 (https://phabricator.wikimedia.org/T421293) (owner: 10Kosta Harlan)
[19:56:16] <godog>	 9
[19:56:48] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1287.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T2000).
[20:00:05] <jouncebot>	 JSherman, stephanebisson, codenamenoreste, and Neriah: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:26] <JSherman>	 o/ ready to go and can self deploy if needed
[20:01:14] <Neriah>	 hi :)
[20:02:24] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1286.eqiad.wmnet with OS bookworm
[20:02:33] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923248 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1286.eqiad.wmnet with OS bookworm
[20:03:19] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[20:04:15] <JSherman>	 It's looking like all config patches today.  I was planning on rolling my 3 together as they are pretty straightforward. Does anybody want to hitch a ride on that deploy?
[20:05:17] <cjming>	 ^^
[20:05:41] <cjming>	 for anyone who misses the ride, i can deploy for whoever needs - just ping
[20:06:55] <JSherman>	 cjming: thanks! I'm showing 5 after now, so I'll get started.
[20:07:07] <cjming>	 sounds good
[20:07:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192921 (https://phabricator.wikimedia.org/T405152) (owner: 10Kgraessle)
[20:07:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286974 (https://phabricator.wikimedia.org/T420450) (owner: 10Jsn.sherman)
[20:07:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286975 (https://phabricator.wikimedia.org/T425509) (owner: 10Jsn.sherman)
[20:08:14] <wikibugs>	 (03Merged) 10jenkins-bot: Enable AutoModerator on Italian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192921 (https://phabricator.wikimedia.org/T405152) (owner: 10Kgraessle)
[20:08:18] <wikibugs>	 (03Merged) 10jenkins-bot: Enable AutoModerator on Albanian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286974 (https://phabricator.wikimedia.org/T420450) (owner: 10Jsn.sherman)
[20:08:22] <wikibugs>	 (03Merged) 10jenkins-bot: Enable AutoModerator on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286975 (https://phabricator.wikimedia.org/T425509) (owner: 10Jsn.sherman)
[20:10:33] <JSherman>	 looks like the deploy timed out on the rebase, which succeeded; retrying
[20:11:08] <logmsgbot>	 !log jsn@deploy1003 Started scap sync-world: Backport for [[gerrit:1192921|Enable AutoModerator on Italian Wikipedia (T405152)]], [[gerrit:1286974|Enable AutoModerator on Albanian Wikipedia (T420450)]], [[gerrit:1286975|Enable AutoModerator on Dutch Wikipedia (T425509)]]
[20:11:15] <stashbot>	 T405152: Enable AutoModerator on Italian Wikipedia - https://phabricator.wikimedia.org/T405152
[20:11:16] <stashbot>	 T420450: Enable AutoModerator on Albanian Wikipedia - https://phabricator.wikimedia.org/T420450
[20:11:16] <stashbot>	 T425509: Enable AutoModerator on Dutch Wikipedia (nlwiki) - https://phabricator.wikimedia.org/T425509
[20:11:46] <JSherman>	 love to see it: `0 languages rebuilt out of 549`
[20:12:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11923273 (10Papaul)
[20:13:03] <logmsgbot>	 !log jsn@deploy1003 kgraessle, jsn: Backport for [[gerrit:1192921|Enable AutoModerator on Italian Wikipedia (T405152)]], [[gerrit:1286974|Enable AutoModerator on Albanian Wikipedia (T420450)]], [[gerrit:1286975|Enable AutoModerator on Dutch Wikipedia (T425509)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:13:33] <JSherman>	 testing
[20:14:39] <logmsgbot>	 !log jsn@deploy1003 kgraessle, jsn: Continuing with deployment
[20:18:17] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1286.eqiad.wmnet with reason: host reimage
[20:18:57] <logmsgbot>	 !log jsn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1192921|Enable AutoModerator on Italian Wikipedia (T405152)]], [[gerrit:1286974|Enable AutoModerator on Albanian Wikipedia (T420450)]], [[gerrit:1286975|Enable AutoModerator on Dutch Wikipedia (T425509)]] (duration: 07m 48s)
[20:19:04] <stashbot>	 T405152: Enable AutoModerator on Italian Wikipedia - https://phabricator.wikimedia.org/T405152
[20:19:04] <stashbot>	 T420450: Enable AutoModerator on Albanian Wikipedia - https://phabricator.wikimedia.org/T420450
[20:19:04] <stashbot>	 T425509: Enable AutoModerator on Dutch Wikipedia (nlwiki) - https://phabricator.wikimedia.org/T425509
[20:19:05] <JSherman>	 cjming: all yours!
[20:19:21] <cjming>	 JSherman: thanks!
[20:19:40] <dancy>	 JSerman: Coming soon: https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/1187
[20:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[20:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[20:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[20:19:46] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1287.eqiad.wmnet with OS bookworm
[20:19:52] <dancy>	 oops: JSherman: ^^
[20:19:52] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1287.eqiad.wmnet with OS bookworm
[20:19:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[20:20:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[20:20:07] <cjming>	 stephanebisson: are you around?
[20:20:25] <cjming>	 codenamenoreste: are you around?
[20:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift
[20:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift
[20:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift
[20:20:48] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923292 (10VRiley-WMF)
[20:20:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift
[20:20:51] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift
[20:21:07] <cjming>	 Neriah: I think you're here?
[20:21:13] <Neriah>	 ya
[20:21:21] <cjming>	 ok i'll do yours next
[20:21:26] <JSherman>	 dancy: 🎉that's awesome!🎉
[20:21:59] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923293 (10VRiley-WMF)
[20:22:01] <wikibugs>	 (03PS3) 10Neriah: Disable wgNewUserMessageOnAutoCreate on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206)
[20:22:31] <wikibugs>	 (03CR) 10Neriah: Disable wgNewUserMessageOnAutoCreate on all WMF wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah)
[20:23:27] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1286.eqiad.wmnet with reason: host reimage
[20:24:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah)
[20:24:58] <wikibugs>	 (03Merged) 10jenkins-bot: Disable wgNewUserMessageOnAutoCreate on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah)
[20:25:15] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1287002|Disable wgNewUserMessageOnAutoCreate on all WMF wikis (T426206)]]
[20:25:19] <stashbot>	 T426206: Per global RfC, only welcome users on Wikimedia projects where they created their account or have edited - https://phabricator.wikimedia.org/T426206
[20:27:05] <logmsgbot>	 !log cjming@deploy1003 cjming, neriah: Backport for [[gerrit:1287002|Disable wgNewUserMessageOnAutoCreate on all WMF wikis (T426206)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:27:14] <Neriah>	 testing
[20:28:52] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1289.eqiad.wmnet with OS bookworm
[20:29:06] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923323 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1289.eqiad.wmnet with OS bookworm
[20:29:10] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1289.eqiad.wmnet with OS bookworm
[20:29:16] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923324 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1289.eqiad.wmnet with OS bookworm executed with errors: - db1289 (**F...
[20:29:46] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:31:05] <Neriah>	 looks good
[20:31:17] <Neriah>	 cjming: you can continue
[20:31:26] <logmsgbot>	 !log cjming@deploy1003 cjming, neriah: Continuing with deployment
[20:31:33] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:35:33] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287002|Disable wgNewUserMessageOnAutoCreate on all WMF wikis (T426206)]] (duration: 10m 18s)
[20:35:37] <stashbot>	 T426206: Per global RfC, only welcome users on Wikimedia projects where they created their account or have edited - https://phabricator.wikimedia.org/T426206
[20:35:39] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1287.eqiad.wmnet with reason: host reimage
[20:35:52] <Neriah>	 thanks :)
[20:35:54] <cjming>	 yw!
[20:36:07] <cjming>	 if anyone else shows up for the window and needs a deployer, please ping me
[20:38:46] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1287.eqiad.wmnet with reason: host reimage
[20:39:17] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: Revert "ml-services: add qwen36-27b to experimental ns" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287480
[20:40:27] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] Revert "ml-services: add qwen36-27b to experimental ns" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287480 (owner: 10Ilias Sarantopoulos)
[20:40:34] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[20:41:12] <stephanebisson>	 Where are we in the backport window?
[20:41:44] <cjming>	 stephanebisson: do you need a deployer for your patch?
[20:42:04] <stephanebisson>	 I can do it, is it my turn?
[20:42:04] <cjming>	 you can self-deploy or i'm happy to deploy for you
[20:42:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11923368 (10Papaul)
[20:42:10] <cjming>	 sure - go for it
[20:42:13] <stephanebisson>	 Thanks
[20:42:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287427 (https://phabricator.wikimedia.org/T426278) (owner: 10Sbisson)
[20:42:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ml-services: add qwen36-27b to experimental ns" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287480 (owner: 10Ilias Sarantopoulos)
[20:43:15] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[20:43:28] <wikibugs>	 (03Merged) 10jenkins-bot: Simplewiki: include article wizard in AG experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287427 (https://phabricator.wikimedia.org/T426278) (owner: 10Sbisson)
[20:43:39] <logmsgbot>	 vriley@cumin1003 reimage (PID 4018159) is awaiting input
[20:43:41] <logmsgbot>	 !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1287427|Simplewiki: include article wizard in AG experiment (T426278)]]
[20:43:43] <wikibugs>	 (03PS1) 10Bking: cirrussearch: Add server depool metadata [puppet] - 10https://gerrit.wikimedia.org/r/1287481 (https://phabricator.wikimedia.org/T327300)
[20:43:45] <stashbot>	 T426278: Enable Article Guidance experiment on Simple English Wikipedia (simplewiki) - https://phabricator.wikimedia.org/T426278
[20:45:30] <logmsgbot>	 !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1287427|Simplewiki: include article wizard in AG experiment (T426278)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:46:35] <logmsgbot>	 !log sbisson@deploy1003 sbisson: Continuing with deployment
[20:48:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:50:44] <logmsgbot>	 !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287427|Simplewiki: include article wizard in AG experiment (T426278)]] (duration: 07m 03s)
[20:50:47] <stashbot>	 T426278: Enable Article Guidance experiment on Simple English Wikipedia (simplewiki) - https://phabricator.wikimedia.org/T426278
[20:52:02] <stephanebisson>	 I'm done
[20:54:06] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "for now let's stick to the simple control of each service being present or absent in Hiera, per DC (or per host)" [puppet] - 10https://gerrit.wikimedia.org/r/1287035 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[20:54:38] <wikibugs>	 (03PS3) 10Seddon: Enable hCaptcha for account creation API on group 0 wiki's [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287479 (https://phabricator.wikimedia.org/T426043)
[20:55:56] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[20:56:24] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[20:56:25] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1287.eqiad.wmnet with OS bookworm
[20:56:37] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923405 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1287.eqiad.wmnet with OS bookworm completed: - db1287 (**PASS**)   -...
[20:57:26] <wikibugs>	 (03PS2) 10Ryan Kemper: cirrussearch: Add server depool metadata [puppet] - 10https://gerrit.wikimedia.org/r/1287481 (https://phabricator.wikimedia.org/T327300) (owner: 10Bking)
[20:57:46] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287481 (https://phabricator.wikimedia.org/T327300) (owner: 10Bking)
[21:00:04] <wikibugs>	 (03PS1) 10Dzahn: zuul: disable all services in codfw, keep enabled in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1287483 (https://phabricator.wikimedia.org/T395938)
[21:00:05] <jouncebot>	 Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T2100)
[21:02:12] <wikibugs>	 (03Abandoned) 10Dzahn: zuul: make all service_ensures dependent on a single active server [puppet] - 10https://gerrit.wikimedia.org/r/1287035 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[21:02:56] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] Enable hCaptcha for account creation API on group 0 wiki's [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287479 (https://phabricator.wikimedia.org/T426043) (owner: 10Seddon)
[21:03:41] <jinxer-wm>	 FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:04:17] <Dreamy_Jazz>	 jouncebot: nowandnext
[21:04:18] <jouncebot>	 For the next 0 hour(s) and 55 minute(s): Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T2100)
[21:04:18] <jouncebot>	 In 8 hour(s) and 55 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260515T0600)
[21:04:29] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1287483/8561/" [puppet] - 10https://gerrit.wikimedia.org/r/1287483 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[21:06:54] <wikibugs>	 (03PS1) 10Dreamy Jazz: Remove DynamicPageList from legalteamwiki as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287484
[21:07:35] <Dreamy_Jazz>	 Going to use scap shortly
[21:08:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287479 (https://phabricator.wikimedia.org/T426043) (owner: 10Seddon)
[21:08:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287484 (owner: 10Dreamy Jazz)
[21:09:43] <wikibugs>	 (03Merged) 10jenkins-bot: Enable hCaptcha for account creation API on group 0 wiki's [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287479 (https://phabricator.wikimedia.org/T426043) (owner: 10Seddon)
[21:09:46] <wikibugs>	 (03Merged) 10jenkins-bot: Remove DynamicPageList from legalteamwiki as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287484 (owner: 10Dreamy Jazz)
[21:10:02] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1287479|Enable hCaptcha for account creation API on group 0 wiki's]], [[gerrit:1287484|Remove DynamicPageList from legalteamwiki as unused]]
[21:11:49] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz, seddon: Backport for [[gerrit:1287479|Enable hCaptcha for account creation API on group 0 wiki's]], [[gerrit:1287484|Remove DynamicPageList from legalteamwiki as unused]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:12:26] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz, seddon: Continuing with deployment
[21:12:57] <wikibugs>	 (03PS1) 10Jdrewniak: Disable Reading Lists survey for Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287485
[21:15:08] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[21:15:09] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1286.eqiad.wmnet with OS bookworm
[21:15:17] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923442 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1286.eqiad.wmnet with OS bookworm completed: - db1286 (**WARN**)   -...
[21:15:27] <wikibugs>	 (03PS2) 10Jdrewniak: Disable Reading Lists survey for Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287485 (https://phabricator.wikimedia.org/T421776)
[21:16:35] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287479|Enable hCaptcha for account creation API on group 0 wiki's]], [[gerrit:1287484|Remove DynamicPageList from legalteamwiki as unused]] (duration: 06m 33s)
[21:16:42] <Dreamy_Jazz>	 Finished with scap
[21:17:37] <wikibugs>	 (03CR) 10Anne Tomasevich: [C:03+1] Disable Reading Lists survey for Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287485 (https://phabricator.wikimedia.org/T421776) (owner: 10Jdrewniak)
[21:17:50] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: Add server depool metadata [puppet] - 10https://gerrit.wikimedia.org/r/1287481 (https://phabricator.wikimedia.org/T327300) (owner: 10Bking)
[21:18:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[21:19:41] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287485 (https://phabricator.wikimedia.org/T421776) (owner: 10Jdrewniak)
[21:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[21:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[21:19:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[21:19:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[21:20:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[21:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:20:49] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:20:51] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.218 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:20:51] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:23:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287485 (https://phabricator.wikimedia.org/T421776) (owner: 10Jdrewniak)
[21:24:19] <wikibugs>	 (03Merged) 10jenkins-bot: Disable Reading Lists survey for Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287485 (https://phabricator.wikimedia.org/T421776) (owner: 10Jdrewniak)
[21:24:33] <logmsgbot>	 !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1287485|Disable Reading Lists survey for Wikipedias (T421776)]]
[21:24:37] <stashbot>	 T421776: Enable the beta feature survey - https://phabricator.wikimedia.org/T421776
[21:26:21] <logmsgbot>	 !log jdrewniak@deploy1003 jdrewniak: Backport for [[gerrit:1287485|Disable Reading Lists survey for Wikipedias (T421776)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:28:05] <EricGardner>	 When Jan is finished, I have one final readers patch during today's window that I plan to backport
[21:28:30] <jinxer-wm>	 RESOLVED: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[21:29:38] <logmsgbot>	 !log jdrewniak@deploy1003 jdrewniak: Continuing with deployment
[21:33:48] <logmsgbot>	 !log jdrewniak@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287485|Disable Reading Lists survey for Wikipedias (T421776)]] (duration: 09m 15s)
[21:33:52] <stashbot>	 T421776: Enable the beta feature survey - https://phabricator.wikimedia.org/T421776
[21:34:43] <wikibugs>	 (03CR) 10Cwhite: Configure nginx to log requests in ECS format to syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[21:35:14] <wikibugs>	 (03PS1) 10Eric Gardner: Share Highlight: overdraw photo on share card canvas [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287488 (https://phabricator.wikimedia.org/T426344)
[21:38:16] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "Recommend holding until the nginx config change is merged and verifying the nginx output logs' ECS compatibility with https://doc.wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis)
[21:38:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287488 (https://phabricator.wikimedia.org/T426344) (owner: 10Eric Gardner)
[21:38:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[21:39:49] <wikibugs>	 (03Merged) 10jenkins-bot: Share Highlight: overdraw photo on share card canvas [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287488 (https://phabricator.wikimedia.org/T426344) (owner: 10Eric Gardner)
[21:40:05] <logmsgbot>	 !log egardner@deploy1003 Started scap sync-world: Backport for [[gerrit:1287488|Share Highlight: overdraw photo on share card canvas (T426344)]]
[21:40:09] <stashbot>	 T426344: [Share Highlights] Image is not showing in the Share card on certain clients - https://phabricator.wikimedia.org/T426344
[21:41:51] <logmsgbot>	 !log egardner@deploy1003 egardner: Backport for [[gerrit:1287488|Share Highlight: overdraw photo on share card canvas (T426344)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:43:11] <logmsgbot>	 !log egardner@deploy1003 egardner: Continuing with deployment
[21:47:19] <logmsgbot>	 !log egardner@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287488|Share Highlight: overdraw photo on share card canvas (T426344)]] (duration: 07m 14s)
[21:47:23] <stashbot>	 T426344: [Share Highlights] Image is not showing in the Share card on certain clients - https://phabricator.wikimedia.org/T426344
[21:53:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[22:11:40] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] zuul: disable all services in codfw, keep enabled in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1287483 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[22:13:51] <jinxer-wm>	 FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[22:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:19:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:19:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:20:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:20:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:20:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:20:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:20:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:20:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:20:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:20:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:20:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:47] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:49] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:51] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:51] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:51] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:51] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:51] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:52] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:55] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:55] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:55] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:24:38] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT
[22:33:51] <jinxer-wm>	 FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[22:45:43] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11923791 (10Papaul)
[22:46:35] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11923792 (10Papaul) 05Open→03Resolved This is complete. thanks to @Jhancock.wm and @Jgreen
[22:48:51] <jinxer-wm>	 FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[22:49:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[22:55:54] <wikibugs>	 06SRE, 10corto, 10Incident Tooling, 13Patch-For-Review: Increase trusted volunteer's visibility into production incidents - https://phabricator.wikimedia.org/T426137#11923800 (10Novem_Linguae) Thanks for working on this and for the quick and thorough replies. I appreciate it.  > For the most part, the wiki...
[23:02:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:07:17] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[23:09:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[23:10:05] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:11:15] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1289
[23:12:52] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1289
[23:13:34] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:14:45] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:19:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:19:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:19:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:19:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:19:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:20:03] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:20:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:20:03] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:20:03] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:20:03] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:20:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:20:05] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:20:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:47] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:49] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:51] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:53] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:53] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:53] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:53] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:53] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:55] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:20:55] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:24:19] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:26:18] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:27:28] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:30:01] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:31:35] <wikibugs>	 (03PS1) 10Thcipriani: phabricator::migration: require config before scap [puppet] - 10https://gerrit.wikimedia.org/r/1287447 (https://phabricator.wikimedia.org/T424055) (owner: 10Dzahn)
[23:33:51] <jinxer-wm>	 FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[23:34:49] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:38:55] <logmsgbot>	 vriley@cumin1003 provision (PID 4044236) is awaiting input
[23:39:08] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:40:03] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1287498
[23:40:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1287498 (owner: 10TrainBranchBot)
[23:49:49] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:51:09] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1287498 (owner: 10TrainBranchBot)
[23:53:51] <jinxer-wm>	 FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[23:53:51] <logmsgbot>	 vriley@cumin1003 provision (PID 4044814) is awaiting input
[23:54:04] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:55:48] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1290
[23:57:00] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1290
[23:57:24] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:58:55] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:59:21] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED