[00:05:58] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:05:58] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:07:23] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs: Restart for upgrade to JVM 11.0.31 - eevans@cumin1003 [00:08:26] FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:16:35] (03PS1) 10Sbisson: Enable the Article Guidance experiment on simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287043 (https://phabricator.wikimedia.org/T426278) [00:16:41] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:17:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287043 (https://phabricator.wikimedia.org/T426278) (owner: 10Sbisson) [00:19:03] FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:37:49] FIRING: DiskSpace: Disk space build2001:9100:/ 1.43% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:10:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1287044 [01:10:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1287044 (owner: 10TrainBranchBot) [01:21:12] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1287044 (owner: 10TrainBranchBot) [02:00:37] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:26] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 49s) [02:09:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:31:28] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:31:30] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:34:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:23] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:46:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:08:41] FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:11:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11920464 (10Papaul) [04:15:53] (03CR) 10WAN233: change logo at zh-classical wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233) [04:19:04] FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:37:49] FIRING: DiskSpace: Disk space build2001:9100:/ 1.429% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:56:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:28] !log marostegui@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 5:00:00 on 13 hosts with reason: Sanitarium s2 master: reimage to Debian Trixie [05:04:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 13 hosts with reason: Sanitarium s7 master: reimage to Debian Trixie [05:05:26] (03PS1) 10Marostegui: db1158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287080 (https://phabricator.wikimedia.org/T425388) [05:05:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1158.eqiad.wmnet with reason: Reimage to Trixie [05:05:50] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1158: Reimage to Trixie [05:06:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1158: Reimage to Trixie [05:10:02] marostegui@cumin1003 reimage (PID 3741973) is awaiting input [05:12:26] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1158.eqiad.wmnet with OS trixie [05:12:47] (03CR) 10Marostegui: [C:03+2] db1158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287080 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui) [05:25:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1158.eqiad.wmnet with reason: host reimage [05:29:29] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1158.eqiad.wmnet with reason: host reimage [05:38:31] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:38:39] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:38:43] (03PS1) 10Marostegui: Revert "db1158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1287248 [05:40:14] (03CR) 10Marostegui: [C:03+2] Revert "db1158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1287248 (owner: 10Marostegui) [05:41:31] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:44:31] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:46:31] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:46:39] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:50:44] PROBLEM - SSH on an-worker1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:51:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1158.eqiad.wmnet with OS trixie [05:51:34] RECOVERY - SSH on an-worker1200 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:54:00] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1158: after reimage to trixie [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T0600). [06:33:51] Deploying cxserver.. [06:35:15] (03PS1) 10Abijeet Patro: ULS rewrite: Enable on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287288 [06:39:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1158: after reimage to trixie [06:39:33] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc2013: Replacing HW T418973 [06:39:33] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:39:36] T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973 [06:39:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:39:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2013: Replacing HW T418973 [06:40:00] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:40:35] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:41:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc[2013,2023].codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance on pc3 [06:42:14] (03PS1) 10Marostegui: mariadb: Productionize pc2023 [puppet] - 10https://gerrit.wikimedia.org/r/1287289 (https://phabricator.wikimedia.org/T418973) [06:43:05] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize pc2023 [puppet] - 10https://gerrit.wikimedia.org/r/1287289 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [07:00:04] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T0700). [07:00:04] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:39] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [07:01:34] !log Update cxserver to 2026-04-23-114216-production (T423002) [07:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:37] T423002: Migrate cxserver in production to node24 - https://phabricator.wikimedia.org/T423002 [07:07:15] (03PS1) 10Ryan Kemper: hadoop.reboot-workers: drop custom --dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/1287290 (https://phabricator.wikimedia.org/T411568) [07:07:47] (03PS8) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1.This is done by changing the previous refinery-sqoop-whole-mediawiki.sh from one big sequential set of sqoops to a parallel structure: - [...]-centralauth-production.sh to sqoop the centralauth production tables. - [...]-mediawiki-clouddb.sh to sqoop the cloudb tables. - [...]-mediawiki-production.sh to sqoop production replicas tabl [07:07:47] 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) [07:08:27] (03CR) 10Ryan Kemper: [C:03+2] hadoop.reboot-workers: make host override smarter (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214664 (https://phabricator.wikimedia.org/T411568) (owner: 10Ryan Kemper) [07:09:43] (03CR) 10CI reject: [V:04-1] changes to accelerate sqoop landing for mediawiki_history_incremental_v1.This is done by changing the previous refinery-sqoop-whole-mediawiki.sh from one big sequential set of sqoops to a parallel structure: - [...]-centralauth-production.sh to sqoop the centralauth production tables. - [...]-mediawiki-clouddb.sh to sqoop the cloudb tables. - [...]-mediawiki-production.sh to sqoop production rep [07:09:43] 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [07:13:48] (03PS2) 10Ryan Kemper: airflow-test-k8s: add ldap-sync task-pod egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286750 (https://phabricator.wikimedia.org/T420691) [07:21:53] (03PS9) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1. [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) [07:23:49] (03CR) 10CI reject: [V:04-1] changes to accelerate sqoop landing for mediawiki_history_incremental_v1. [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [07:25:31] (03PS3) 10Ryan Kemper: cirrussearch: install atop utility [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [07:25:42] (03CR) 10Ryan Kemper: "I added a guard so we won't have the PCC failure (would break puppet on cirrussearch afaict)" [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [07:26:17] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [07:29:15] (03CR) 10Ryan Kemper: [C:03+1] "LGTM now; pcc's happy" [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [07:29:45] (03PS10) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1. [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) [07:31:42] (03CR) 10CI reject: [V:04-1] changes to accelerate sqoop landing for mediawiki_history_incremental_v1. [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [07:34:35] (03PS11) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1. [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) [07:49:02] (03PS1) 10Elukey: docker_registry: allow multiple docker instances [puppet] - 10https://gerrit.wikimedia.org/r/1287292 (https://phabricator.wikimedia.org/T420978) [08:00:05] andre and brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T0800). [08:00:06] I will now start promoting group2 wikis to 1.47.0-wmf.2 [08:01:34] (03PS1) 10TrainBranchBot: group2 to 1.47.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287355 (https://phabricator.wikimedia.org/T423911) [08:01:36] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287355 (https://phabricator.wikimedia.org/T423911) (owner: 10TrainBranchBot) [08:02:43] (03Merged) 10jenkins-bot: group2 to 1.47.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287355 (https://phabricator.wikimedia.org/T423911) (owner: 10TrainBranchBot) [08:04:55] (03PS7) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) [08:06:30] (03CR) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [08:06:36] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [08:06:53] jouncebot: next [08:06:53] In 1 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1000) [08:08:41] FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:49] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.47.0-wmf.2 refs T423911 [08:08:54] T423911: 1.47.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T423911 [08:10:06] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:10:06] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:12:35] RESOLVED: DiskSpace: Disk space build2001:9100:/ 1.435% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:19:04] FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:21:30] the pull failed, gerrit issue? [08:34:22] (03CR) 10Tiziano Fogli: [C:03+2] thanos/query: set thanos-query alert.query-url [puppet] - 10https://gerrit.wikimedia.org/r/1286811 (https://phabricator.wikimedia.org/T425400) (owner: 10Tiziano Fogli) [08:37:11] (03PS1) 10Marostegui: instances.yaml: Remove db2149 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1287356 (https://phabricator.wikimedia.org/T424341) [08:38:08] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2149 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1287356 (https://phabricator.wikimedia.org/T424341) (owner: 10Marostegui) [08:39:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove db2149 T424341', diff saved to https://phabricator.wikimedia.org/P92520 and previous config saved to /var/cache/conftool/dbconfig/20260514-083916-marostegui.json [08:39:20] T424341: decommission db2149.codfw.wmnet - https://phabricator.wikimedia.org/T424341 [08:40:04] (03PS1) 10Marostegui: db2149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287357 (https://phabricator.wikimedia.org/T424341) [08:40:50] (03CR) 10Marostegui: [C:03+2] db2149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287357 (https://phabricator.wikimedia.org/T424341) (owner: 10Marostegui) [08:49:26] andre: how far down are you ? [08:50:10] effie: Done, go ahead [08:50:12] Seeing one spike but that's on a closed wiki and nothing to roll back for [08:50:22] grand thank you [08:51:38] (03CR) 10Effie Mouzeli: [C:03+2] ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286875 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [08:53:17] (03CR) 10Atsuko: [C:03+2] opensearch-ttmserver: switch to opensearch 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286957 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [08:53:43] cumin2002: I think it is because there are local changes on the homer repo [08:53:47] (03Merged) 10jenkins-bot: ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286875 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [08:53:48] I will tell the netops [08:54:16] !log root@cumin1003 START - Cookbook sre.hosts.reimage for host mc1065.eqiad.wmnet with OS bullseye [08:54:18] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1066.eqiad.wmnet with OS bullseye [08:54:20] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1067.eqiad.wmnet with OS bullseye [08:54:27] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1068.eqiad.wmnet with OS bullseye [08:55:13] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply [08:55:33] (03Merged) 10jenkins-bot: opensearch-ttmserver: switch to opensearch 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286957 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [08:55:38] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [08:56:41] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:32] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1066.eqiad.wmnet with reason: host reimage [09:06:37] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1067.eqiad.wmnet with reason: host reimage [09:06:49] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1068.eqiad.wmnet with reason: host reimage [09:06:51] !log root@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1065.eqiad.wmnet with reason: host reimage [09:07:31] (03CR) 10Btullis: changes to accelerate sqoop landing for mediawiki_history_incremental_v1. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [09:10:38] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1066.eqiad.wmnet with reason: host reimage [09:11:24] (03CR) 10Btullis: [C:03+1] archiva: block scraper UAs at nginx [puppet] - 10https://gerrit.wikimedia.org/r/1286536 (https://phabricator.wikimedia.org/T426114) (owner: 10Ryan Kemper) [09:11:43] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1287361 (https://phabricator.wikimedia.org/T426291) [09:11:56] (03PS1) 10Ilias Sarantopoulos: (WIP)ml-services: add qwen36-27b to experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) [09:14:28] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1068.eqiad.wmnet with reason: host reimage [09:16:25] (03PS1) 10Marco Fossati: Scale share-highlight card to fit small viewports [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287363 (https://phabricator.wikimedia.org/T426247) [09:16:41] (03PS2) 10Effie Mouzeli: api-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285340 (https://phabricator.wikimedia.org/T419976) [09:17:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287363 (https://phabricator.wikimedia.org/T426247) (owner: 10Marco Fossati) [09:17:16] PROBLEM - Memcached on mc1065 is CRITICAL: connect to address 10.64.177.8 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:17:16] PROBLEM - Memcached on mc1067 is CRITICAL: connect to address 10.64.183.11 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:18:39] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1065.eqiad.wmnet with reason: host reimage [09:19:04] (03CR) 10Effie Mouzeli: [C:03+2] api-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285340 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [09:20:55] !log rebalance codfw swift rings T354872 [09:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:58] T354872: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872 [09:21:07] (03Merged) 10jenkins-bot: api-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285340 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [09:21:52] (03PS1) 10CWilliams: icinga/cgi.cfg: Adding CWilliams to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1287364 (https://phabricator.wikimedia.org/T426292) [09:23:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1067.eqiad.wmnet with reason: host reimage [09:24:53] (03CR) 10Marostegui: [C:03+1] icinga/cgi.cfg: Adding CWilliams to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1287364 (https://phabricator.wikimedia.org/T426292) (owner: 10CWilliams) [09:25:16] RECOVERY - Memcached on mc1065 is OK: TCP OK - 0.000 second response time on 10.64.177.8 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [09:25:36] (03CR) 10Zabe: "Do we want to try to implement some sort of "slow rollout" for this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [09:25:57] (03CR) 10Zabe: "Do we want to try to implement some sort of "slow rollout" for this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [09:26:00] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1066.eqiad.wmnet with OS bullseye [09:26:40] (03CR) 10CWilliams: [C:03+2] icinga/cgi.cfg: Adding CWilliams to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1287364 (https://phabricator.wikimedia.org/T426292) (owner: 10CWilliams) [09:27:46] (03PS1) 10MVernon: swift: remove 2 drained nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1287365 (https://phabricator.wikimedia.org/T354872) [09:29:58] (03CR) 10Marostegui: [C:03+1] swift: remove 2 drained nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1287365 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [09:30:16] RECOVERY - Memcached on mc1067 is OK: TCP OK - 0.000 second response time on 10.64.183.11 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [09:30:27] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1068.eqiad.wmnet with OS bullseye [09:33:49] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1065.eqiad.wmnet with OS bullseye [09:39:22] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1067.eqiad.wmnet with OS bullseye [09:41:51] (03PS1) 10JavierMonton: stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287366 (https://phabricator.wikimedia.org/T423920) [09:43:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [09:43:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [09:44:08] !ack [09:44:09] 7929 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [09:44:57] that's the codfw -> eqsin link? [09:46:33] bjensen: yep [09:46:42] on my laptop in 5/10min [09:47:01] but it's scrapping in eqsin [09:49:50] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2152.codfw.wmnet: Host will be decommissioned [09:51:36] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [09:51:59] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [09:54:03] 10SRE-swift-storage, 06Commons: Commons files doesn't update - https://phabricator.wikimedia.org/T426293#11920913 (10A_smart_kitten) FWIW, if the correct image to be displayed for https://commons.wikimedia.org/wiki/File:CitationHelper_-_VE_Editor_Toolbar.png is the one in @aklapper's screenshot, then it seems... [09:54:26] (03CR) 10MVernon: [C:03+2] swift: remove 2 drained nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1287365 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [09:54:52] !log cwilliams@cumin1003 END (ERROR) - Cookbook sre.mysql.depool (exit_code=97) depool db2152.codfw.wmnet: Host will be decommissioned [09:55:32] (03CR) 10Effie Mouzeli: [C:03+2] redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [09:55:44] (03CR) 10CI reject: [V:04-1] redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [09:58:49] RESOLVED: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:58:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [09:58:56] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [09:59:03] (03PS1) 10Robertsky: throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) [09:59:13] (03PS1) 10Phuedx: ext.wikimediaEvents: Add synth-aa-ncs-1 experiment [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287368 (https://phabricator.wikimedia.org/T419514) [09:59:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287368 (https://phabricator.wikimedia.org/T419514) (owner: 10Phuedx) [09:59:52] (03CR) 10CI reject: [V:04-1] throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1000) [10:00:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky) [10:00:29] (03PS2) 10Effie Mouzeli: redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) [10:00:31] 10SRE-swift-storage, 06Commons: Commons files doesn't update - https://phabricator.wikimedia.org/T426293#11920979 (10jcrespo) Hi, @Jcubic thanks for the report. On upload of a new version, caches are normally purged from our content delivery network, however how much time it takes for that to propagate depends... [10:01:23] 10SRE-swift-storage, 06Commons: Commons files doesn't update - https://phabricator.wikimedia.org/T426293#11920981 (10A_smart_kitten) I seem to be getting different responses for from different datacenters. Potentially(... [10:02:07] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2152: Host will be decommissioned [10:02:26] 10SRE-swift-storage, 06Commons: Commons files doesn't update - https://phabricator.wikimedia.org/T426293#11920984 (10jcrespo) >>! In T426293#11920981, @A_smart_kitten wrote: > I seem to be getting different responses for !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2152: Host will be decommissioned [10:02:51] FIRING: CoreRouterInterfaceDropPercent: Core router normal + high priority queue drops are high on cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, ... [10:02:51] MAC filter) {#1016}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#CoreRouterInterfaceDropPercent - https://grafana.wikimedia.org/d/5p97dAASz/queue-and-error-stats-by-network-device?var-site=eqsin+prometheus%2Fops&var-device=cr3-eqsin&var-interface=xe-0%2F1%2F3 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDropPercent [10:03:09] 10SRE-swift-storage, 06Commons: Commons files doesn't update - https://phabricator.wikimedia.org/T426293#11920988 (10A_smart_kitten) I think we were both typing comments here at the same time :D [10:05:08] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:05:10] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:07:48] (03PS3) 10Effie Mouzeli: redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) [10:07:51] RESOLVED: CoreRouterInterfaceDropPercent: Core router normal + high priority queue drops are high on cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, ... [10:07:51] MAC filter) {#1016}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#CoreRouterInterfaceDropPercent - https://grafana.wikimedia.org/d/5p97dAASz/queue-and-error-stats-by-network-device?var-site=eqsin+prometheus%2Fops&var-device=cr3-eqsin&var-interface=xe-0%2F1%2F3 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDropPercent [10:10:48] (03PS1) 10CWilliams: instances.yaml: Decommissioning db2152.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287371 (https://phabricator.wikimedia.org/T424344) [10:11:21] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [10:11:30] !ack [10:11:31] FIRING: Traffic on tunnel link: Alert for device cr1-drmrs.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [10:11:31] 7930 (ACKED) TransitPeeringTransportOutSaturation network sre (gnmi) [10:14:02] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [10:14:09] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [10:14:35] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1069.eqiad.wmnet with OS bullseye [10:15:21] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1063.eqiad.wmnet with OS bullseye [10:15:37] (03PS1) 10Cathal Mooney: wmf-netbox: add new bgp group mappings for dse-k8s-wdqs nodes [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1287372 (https://phabricator.wikimedia.org/T425653) [10:16:28] (03PS2) 10Cathal Mooney: wmf-netbox: add new bgp group mappings for dse-k8s-wdqs nodes [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1287372 (https://phabricator.wikimedia.org/T425653) [10:16:31] RESOLVED: Traffic on tunnel link: Device cr1-drmrs.wikimedia.org recovered from Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [10:17:58] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [10:18:30] (03CR) 10Effie Mouzeli: [C:03+2] "Similar to I64475fafdae90bc55ff3e8046dda48b85217594d" [puppet] - 10https://gerrit.wikimedia.org/r/1286775 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [10:18:34] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:34] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:36] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:40] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:46] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:46] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:46] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:46] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:46] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:46] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:46] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:47] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:47] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:48] PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:48] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:49] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:19:18] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [10:19:23] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [10:19:24] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:24] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:24] (03CR) 10Effie Mouzeli: [C:03+1] redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [10:19:28] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:30] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:31] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1287372 (https://phabricator.wikimedia.org/T425653) (owner: 10Cathal Mooney) [10:19:33] what was that? [10:19:36] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:36] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:36] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:36] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:36] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:36] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:36] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:37] RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:37] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:38] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:38] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:39] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Swift [10:20:01] (03PS4) 10Effie Mouzeli: redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) [10:20:49] (03PS5) 10Effie Mouzeli: redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) [10:21:02] 06SRE, 06Infrastructure-Foundations, 10netops: Create single Homer BGP group template to cover all variants - https://phabricator.wikimedia.org/T349116#11921022 (10cmooney) 05Open→03Declined Closing this one for now. We do need to look a this, but also we need to review in light of having both Junip... [10:21:21] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr4-ulsfo:xe-0/1/2 (Transport: cr2-eqsin:xe-0/1/4 (NTT, ... [10:21:21] 369639) {#1076}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=ulsfo+prometheus%2Fops&var-device=cr4-ulsfo:9804&var-interface=xe-0%2F1%2F2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [10:21:26] the blip seemed real, although not a lot of impact [10:21:34] (03CR) 10Marostegui: [C:03+1] instances.yaml: Decommissioning db2152.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287371 (https://phabricator.wikimedia.org/T424344) (owner: 10CWilliams) [10:21:35] network would make sense as the culprit [10:21:37] I'm guessing that's related to the saturation [10:21:43] oh [10:21:57] (03CR) 10CWilliams: [C:03+2] instances.yaml: Decommissioning db2152.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287371 (https://phabricator.wikimedia.org/T424344) (owner: 10CWilliams) [10:22:06] I didn't know that was ongoing [10:25:27] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [10:25:32] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [10:26:18] 06SRE, 10observability: Setup BGP monitoring for PyBal, including amount of prefixes - https://phabricator.wikimedia.org/T79124#11921068 (10cmooney) 05Open→03Resolved a:03cmooney I'm gonna close this one. We now have alerting on this via the bgp stats exported via gnmi (see [[ https://gerrit.wikimed... [10:26:33] cortobot: list [10:26:46] oh whoops :D [10:27:09] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1069.eqiad.wmnet with reason: host reimage [10:27:26] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1063.eqiad.wmnet with reason: host reimage [10:28:57] (03PS2) 10Robertsky: throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) [10:29:46] (03CR) 10CI reject: [V:04-1] throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky) [10:31:44] (03CR) 10Effie Mouzeli: [C:03+2] redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [10:33:20] (03PS1) 10Btullis: Configure rsyslog to forward 'dumps-http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) [10:33:29] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [10:33:44] (03Merged) 10jenkins-bot: redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285339 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [10:33:56] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772#11921091 (10cmooney) 05Open→03Resolved a:03cmooney I'm going to close this one. I think everyone is agreed cross-rack links to have an LVS peer... [10:34:13] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1069.eqiad.wmnet with reason: host reimage [10:34:17] !log jiji@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply [10:34:31] !log jiji@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply [10:38:13] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1063.eqiad.wmnet with reason: host reimage [10:40:53] !log jiji@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [10:41:33] !log jiji@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [10:42:05] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:42:15] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:42:42] looking [10:43:07] cezmunsta: ^ [10:43:19] it's db2152 being removed [10:43:24] cezmunsta: once the puppet change is merged, you need to execute the dbctl command to remove it from dbctl in cumin [10:43:52] cezmunsta: as mentioned here: https://wikitech.wikimedia.org/wiki/MariaDB/Decommissioning_a_DB_Host#Remove_the_host_from_dbctl [10:44:19] cezmunsta: you can just run: sudo dbctl config commit -m "Remove HOSTNAME from dbctl TASKNUMBER" from cumin1003 for instance [10:44:21] * cezmunsta Yep, but currently not resolving that hose :) [10:44:25] *host [10:44:41] cezmunsta: do you want me to run it for you so it doesn't get hanging there for long? [10:44:42] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11921120 (10cmooney) 05Open→03Resolved Ok this is rolled out and working. I have tried to update our dashboards wher... [10:44:58] * cezmunsta : yes please [10:45:05] cezmunsta: doing it! [10:45:09] ty [10:45:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove db2152 from dbctl T424344', diff saved to https://phabricator.wikimedia.org/P92523 and previous config saved to /var/cache/conftool/dbconfig/20260514-104521-marostegui.json [10:45:25] T424344: decommission db2152.codfw.wmnet - https://phabricator.wikimedia.org/T424344 [10:45:27] done! [10:47:05] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:47:15] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:49:19] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11921125 (10cmooney) {F81386939 width=600} [10:49:29] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1069.eqiad.wmnet with OS bullseye [10:50:21] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11921128 (10MatthewVernon) [10:53:25] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1063.eqiad.wmnet with OS bullseye [10:53:55] !log jiji@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply [10:53:58] !log jiji@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply [10:56:02] (03PS1) 10Federico Ceratto: cumin: Install pydantic and httpx packages [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926) [10:56:02] (03CR) 10Federico Ceratto: "(as discussed on IRC with elukey)" [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926) (owner: 10Federico Ceratto) [10:56:06] (03CR) 10Cathal Mooney: [C:03+2] wmf-netbox: add new bgp group mappings for dse-k8s-wdqs nodes [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1287372 (https://phabricator.wikimedia.org/T425653) (owner: 10Cathal Mooney) [10:56:30] (03CR) 10CI reject: [V:04-1] cumin: Install pydantic and httpx packages [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926) (owner: 10Federico Ceratto) [10:57:14] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1286311 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi) [10:58:59] (03PS3) 10Ayounsi: Replace upstream_speed with commit_rate [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1286311 (https://phabricator.wikimedia.org/T424839) [11:00:11] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [11:00:26] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [11:00:53] !log jiji@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: sync [11:01:04] !log jiji@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: sync [11:01:17] (03CR) 10CI reject: [V:04-1] Replace upstream_speed with commit_rate [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1286311 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi) [11:02:04] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [11:05:52] (03PS2) 10Federico Ceratto: cumin: Install pydantic and httpx packages [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926) [11:08:56] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [11:08:59] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [11:13:20] (03PS3) 10Robertsky: throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) [11:14:14] (03CR) 10CI reject: [V:04-1] throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky) [11:16:50] (03PS4) 10Robertsky: throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) [11:17:41] (03CR) 10CI reject: [V:04-1] throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky) [11:19:19] (03PS5) 10Robertsky: throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) [11:19:33] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:19:41] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:19:43] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:19:47] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:20:10] (03CR) 10CI reject: [V:04-1] throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky) [11:20:21] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:20:31] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [11:20:37] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [11:21:02] (03CR) 10Cathal Mooney: "recheck" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1286311 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi) [11:22:21] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:23:06] (03PS6) 10Robertsky: throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) [11:25:08] (03Abandoned) 10Sergio Gimeno: loggedOutWarning: set lastEditor used earlier [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1285743 (https://phabricator.wikimedia.org/T425604) (owner: 10Sergio Gimeno) [11:26:02] (03CR) 10Chlod Alejandro: [C:03+1] throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky) [11:26:35] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:26:43] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:31:12] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [11:31:22] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [11:48:13] (03CR) 10Elukey: "The request is legit, I'll go through my team just to be sure and come back!" [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926) (owner: 10Federico Ceratto) [11:54:48] (03PS1) 10Marostegui: instances.yaml: Add pc2023 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1287384 (https://phabricator.wikimedia.org/T418973) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1200) [12:08:41] FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:28] (03CR) 10Cathal Mooney: [C:03+2] Replace upstream_speed with commit_rate [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1286311 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi) [12:10:33] (03PS1) 10Atsuko: services_proxy: isetting up toolhub and ttmserver [puppet] - 10https://gerrit.wikimedia.org/r/1287388 (https://phabricator.wikimedia.org/T424248) [12:14:39] (03PS3) 10Federico Ceratto: cumin: Install pydantic and httpx packages [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926) [12:15:30] (03CR) 10Federico Ceratto: cumin: Install pydantic and httpx packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926) (owner: 10Federico Ceratto) [12:15:36] (03PS2) 10Effie Mouzeli: rest-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285344 (https://phabricator.wikimedia.org/T419976) [12:16:53] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add pc2023 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1287384 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [12:17:50] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1287391 (owner: 10L10n-bot) [12:18:16] (03CR) 10Effie Mouzeli: [C:03+2] rest-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285344 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [12:18:21] another blip [12:18:28] on codfw upload [12:18:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add pc2023 to pc3 T418973', diff saved to https://phabricator.wikimedia.org/P92524 and previous config saved to /var/cache/conftool/dbconfig/20260514-121839-marostegui.json [12:18:41] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:18:41] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:18:43] T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973 [12:18:49] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:18:49] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:18:49] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:18:49] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:18:49] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:18:49] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:18:50] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:18:50] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:18:51] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:18:51] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:18:52] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:19:15] larger this time [12:19:31] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:31] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:39] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:39] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:39] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:39] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:39] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:39] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:39] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:40] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:40] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:41] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:41] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.217 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add pc2023 to pc3 codfw master T418973', diff saved to https://phabricator.wikimedia.org/P92525 and previous config saved to /var/cache/conftool/dbconfig/20260514-121958-marostegui.json [12:20:24] (03Merged) 10jenkins-bot: rest-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285344 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [12:20:53] (03PS1) 10Marostegui: pc2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287395 (https://phabricator.wikimedia.org/T418973) [12:21:15] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:21:32] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:21:35] (03CR) 10Marostegui: [C:03+2] pc2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287395 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [12:22:17] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T426221#11921398 (10Jclark-ctr) Rebalanced pdu still monitoring continuing to monitor [12:22:36] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1287396 (owner: 10L10n-bot) [12:24:43] (03PS1) 10Marostegui: pc2023: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287397 (https://phabricator.wikimedia.org/T418973) [12:26:01] (03CR) 10Marostegui: [C:03+2] pc2023: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287397 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [12:27:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool pc3 with pc2023 as codfw master T418973', diff saved to https://phabricator.wikimedia.org/P92526 and previous config saved to /var/cache/conftool/dbconfig/20260514-122707-marostegui.json [12:27:12] T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973 [12:27:36] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 28458 [12:28:48] (03PS1) 10Btullis: Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) [12:31:13] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 28458 [12:33:13] (03PS1) 10Marostegui: installserver: Do not format pc2023 [puppet] - 10https://gerrit.wikimedia.org/r/1287408 (https://phabricator.wikimedia.org/T418973) [12:37:15] (03CR) 10Nikerabbit: [C:04-1] "No code change?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287288 (owner: 10Abijeet Patro) [12:39:02] (03PS1) 10KartikMistry: Update cxserver to 2026-05-14-123010-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287409 (https://phabricator.wikimedia.org/T426174) [12:40:31] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: update bgp groups for dse-k8s-wdqs - cmooney@cumin1003 [12:42:10] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: update bgp groups for dse-k8s-wdqs - cmooney@cumin1003 [12:42:36] (03CR) 10Sbisson: [C:03+1] Update cxserver to 2026-05-14-123010-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287409 (https://phabricator.wikimedia.org/T426174) (owner: 10KartikMistry) [12:42:56] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [12:43:06] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2026-05-14-123010-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287409 (https://phabricator.wikimedia.org/T426174) (owner: 10KartikMistry) [12:43:14] Deploying cxserver. [12:45:11] (03Merged) 10jenkins-bot: Update cxserver to 2026-05-14-123010-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287409 (https://phabricator.wikimedia.org/T426174) (owner: 10KartikMistry) [12:45:51] (03CR) 10Cathal Mooney: [C:03+2] GraphQL: replace termination_z upstream_speed with commit_rate [software/homer] - 10https://gerrit.wikimedia.org/r/1286310 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi) [12:46:40] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [12:47:03] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:47:23] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1279] - vriley@cumin1003" [12:47:28] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1279] - vriley@cumin1003" [12:47:28] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:47:55] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1279 [12:49:28] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [12:49:38] (03PS1) 10Btullis: mediawiki-dumps-legacy: Allow launching dumps from airflow-wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287411 (https://phabricator.wikimedia.org/T422179) [12:49:43] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T426291 [12:49:44] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1279 [12:49:46] T426291: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T426291 [12:50:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db2161 with weight 0 T426291', diff saved to https://phabricator.wikimedia.org/P92527 and previous config saved to /var/cache/conftool/dbconfig/20260514-125014-fceratto.json [12:50:34] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1279.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:53:15] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1280] - vriley@cumin1003" [12:53:21] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1280] - vriley@cumin1003" [12:53:21] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:53:39] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1280 [12:54:42] (03CR) 10Lerickson: [C:03+1] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287411 (https://phabricator.wikimedia.org/T422179) (owner: 10Btullis) [12:54:55] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1280 [12:54:55] (03PS2) 10Abijeet Patro: ULS rewrite: Enable on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287288 (https://phabricator.wikimedia.org/T426288) [12:55:17] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [12:55:48] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:56:14] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1280.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:56:23] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:56:41] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:57:40] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:58:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921466 (10VRiley-WMF) 05Open→03In progress [12:58:09] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1287361 (https://phabricator.wikimedia.org/T426291) (owner: 10Gerrit maintenance bot) [12:58:15] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:59:05] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1281] - vriley@cumin1003" [12:59:11] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1281] - vriley@cumin1003" [12:59:11] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:59:21] o/ [12:59:32] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1281 [13:00:05] Urbanecm and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1300). [13:00:05] annet, Nvdtn19, Krinkle, stephanebisson, mfossati, phuedx, and robertsky: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] o/ [13:00:20] o/ [13:00:21] (03Merged) 10jenkins-bot: GraphQL: replace termination_z upstream_speed with commit_rate [software/homer] - 10https://gerrit.wikimedia.org/r/1286310 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi) [13:00:32] o/ [13:00:38] !log Updated cxserver to 2026-05-14-123010-production (T426174, T404298) [13:00:41] !log Starting s8 codfw failover from db2165 to db2161 - T426291 [13:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:43] T426174: cxserver unit tests leak mediawiki api requests - https://phabricator.wikimedia.org/T426174 [13:00:43] T404298: Can't translate en:Tokyo in Gujarati - https://phabricator.wikimedia.org/T404298 [13:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:46] T426291: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T426291 [13:01:33] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1281 [13:01:44] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [13:01:49] annet: stephanebisson: Do either of you want to self service, if a deployer isn't available? [13:02:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db2161 to s8 primary T426291', diff saved to https://phabricator.wikimedia.org/P92528 and previous config saved to /var/cache/conftool/dbconfig/20260514-130213-fceratto.json [13:02:17] Krinkle: I haven't done that before so would prefer not to [13:02:22] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:03:26] FIRING: [44x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:46] I can deploy my patch [13:03:53] When it's my turn [13:04:49] stephanebisson: OK, I'd say go ahead. I'll can do annet and mine after that. [13:05:06] thanks! [13:05:19] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:05:57] on it [13:06:14] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1282] - vriley@cumin1003" [13:06:19] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1282] - vriley@cumin1003" [13:06:19] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:07:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287043 (https://phabricator.wikimedia.org/T426278) (owner: 10Sbisson) [13:07:19] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1282 [13:07:32] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [13:07:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set correct weight T426291', diff saved to https://phabricator.wikimedia.org/P92529 and previous config saved to /var/cache/conftool/dbconfig/20260514-130743-fceratto.json [13:07:47] T426291: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T426291 [13:08:00] (03Merged) 10jenkins-bot: Enable the Article Guidance experiment on simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287043 (https://phabricator.wikimedia.org/T426278) (owner: 10Sbisson) [13:08:18] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2165: Repooling after switchover [13:08:24] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1279.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:08:42] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1287043|Enable the Article Guidance experiment on simplewiki (T426278)]] [13:08:44] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1282 [13:08:45] T426278: Enable Article Guidance experiment on Simple English Wikipedia (simplewiki) - https://phabricator.wikimedia.org/T426278 [13:09:02] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:09:14] (03CR) 10Ottomata: "Hm, what topic does this send to? I suppose a generic rsylog topic? Or is it a separate nginxdumps only topic?" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [13:09:52] (03PS1) 10CWilliams: mariadb: Decommission db2152 [puppet] - 10https://gerrit.wikimedia.org/r/1287414 [13:09:53] (03CR) 10Ottomata: "Oo, and does this send message to a topic combined with other rsyslog ECS formatted messages? If so, we have some thinking to do!" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [13:10:13] PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100% [13:10:20] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2165: Repooling after switchover [13:10:28] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [13:10:32] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1287043|Enable the Article Guidance experiment on simplewiki (T426278)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:10:42] (03PS2) 10CWilliams: mariadb: Decommission db2152 [puppet] - 10https://gerrit.wikimedia.org/r/1287414 (https://phabricator.wikimedia.org/T424344) [13:11:30] o/ robertsky doesn't seem to be available so i can supervise his patch instead [13:12:04] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1282.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:12:08] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:12:41] !log sbisson@deploy1003 sbisson: Continuing with deployment [13:13:17] hihi [13:13:19] o/ [13:13:20] oh there he is [13:13:24] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287388 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [13:13:59] (03CR) 10Ottomata: Configure nginx to log requests in ECS format to syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [13:14:23] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1280.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:14:41] hi Krinkle, can you help deploy my patch? apologies, was caught in the traffic back to the hotel. [13:14:50] (03CR) 10Marostegui: [C:03+1] mariadb: Decommission db2152 [puppet] - 10https://gerrit.wikimedia.org/r/1287414 (https://phabricator.wikimedia.org/T424344) (owner: 10CWilliams) [13:15:18] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1283] - vriley@cumin1003" [13:15:23] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1283] - vriley@cumin1003" [13:15:23] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:15:55] (03PS1) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.11.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1287415 [13:16:15] robertsky: np, once it's our turn I will look at yours. there are a few other patches before yours. [13:16:32] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1287388 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [13:16:52] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287043|Enable the Article Guidance experiment on simplewiki (T426278)]] (duration: 08m 10s) [13:16:55] T426278: Enable Article Guidance experiment on Simple English Wikipedia (simplewiki) - https://phabricator.wikimedia.org/T426278 [13:16:57] thanks! [13:17:32] annet: would you like to deploy the backport and config simultanously or one after the other? [13:17:41] My deployment is done [13:17:56] (03CR) 10Marostegui: [C:03+2] installserver: Do not format pc2023 [puppet] - 10https://gerrit.wikimedia.org/r/1287408 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [13:17:57] Krinkle: simultaneous would be great [13:18:00] Okay [13:18:05] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1283 [13:18:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285913 (https://phabricator.wikimedia.org/T422169) (owner: 10Anne Tomasevich) [13:18:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286327 (https://phabricator.wikimedia.org/T422169) (owner: 10Anne Tomasevich) [13:18:29] (03PS1) 10Tiziano Fogli: slothslos/deploy: wrap cleanup command in Bash to allow brace expansion [puppet] - 10https://gerrit.wikimedia.org/r/1287416 (https://phabricator.wikimedia.org/T414579) [13:18:39] wmf.2 is everywhere according to https://versions.toolforge.org/ [13:18:40] (03CR) 10Atsuko: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [13:18:41] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:18:45] RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 32.97 ms [13:18:49] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:18:49] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:18:53] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:18:53] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:18:53] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:18:53] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:18:53] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:18:53] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:18:53] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:18:54] PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:18:54] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:18:55] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:18:56] stand by for testing... [13:19:07] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:19:14] (03Merged) 10jenkins-bot: Add ReadingLists Account Creation CTA campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285913 (https://phabricator.wikimedia.org/T422169) (owner: 10Anne Tomasevich) [13:19:28] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1283 [13:19:33] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [13:19:39] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [13:19:39] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift [13:19:43] RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [13:19:43] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [13:19:43] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [13:19:43] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [13:19:43] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [13:19:43] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [13:19:43] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [13:19:44] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [13:19:44] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [13:19:45] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [13:20:37] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1283.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:21:01] (03CR) 10Tiziano Fogli: [C:03+2] slothslos/deploy: wrap cleanup command in Bash to allow brace expansion [puppet] - 10https://gerrit.wikimedia.org/r/1287416 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli) [13:21:39] https://integration.wikimedia.org/zuul/#q=wmf [13:21:49] (03PS12) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 [13:21:49] PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100% [13:22:45] RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 32.97 ms [13:22:46] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:23:38] (03PS13) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 [13:24:57] (03CR) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (owner: 10A-pizzata) [13:25:11] (03PS14) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 [13:25:29] (03PS15) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) [13:25:38] (03PS3) 10Ilias Sarantopoulos: ml-services: add qwen36-27b to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) [13:29:10] (03Merged) 10jenkins-bot: WelcomeSurvey: Respect returnTo for campaigns skipping the survey [extensions/GrowthExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286327 (https://phabricator.wikimedia.org/T422169) (owner: 10Anne Tomasevich) [13:29:26] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1285913|Add ReadingLists Account Creation CTA campaign (T422169)]], [[gerrit:1286327|WelcomeSurvey: Respect returnTo for campaigns skipping the survey (T422169)]] [13:29:29] T422169: Account Creation CTA experiment: handle experience after account creation - https://phabricator.wikimedia.org/T422169 [13:29:48] (03CR) 10CI reject: [V:04-1] CHANGELOG: add changelogs for release v0.11.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1287415 (owner: 10Cathal Mooney) [13:29:50] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1279.eqiad.wmnet with OS bookworm [13:29:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921600 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1279.eqiad.wmnet with OS bookworm [13:30:34] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1282.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:31:13] !log krinkle@deploy1003 krinkle, annet: Backport for [[gerrit:1285913|Add ReadingLists Account Creation CTA campaign (T422169)]], [[gerrit:1286327|WelcomeSurvey: Respect returnTo for campaigns skipping the survey (T422169)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:31:23] testing... [13:31:50] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2150: Host will be decommissioned [13:32:10] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2150: Host will be decommissioned [13:32:25] Krinkle: LGTM, thanks again [13:33:06] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2151: Host will be decommissioned [13:33:24] (03CR) 10Cathal Mooney: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/1225579 (owner: 10Ayounsi) [13:33:26] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2151: Host will be decommissioned [13:34:45] (03CR) 10Atsuko: [C:03+2] services_proxy: isetting up toolhub and ttmserver [puppet] - 10https://gerrit.wikimedia.org/r/1287388 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [13:36:04] (03PS1) 10Bking: dse-k8s: roll back unnecessary TLS changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287419 (https://phabricator.wikimedia.org/T421293) [13:36:27] Krinkle: I can self-deploy [13:37:45] annet: ok, proceeding [13:37:47] !log krinkle@deploy1003 krinkle, annet: Continuing with deployment [13:38:12] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1283.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:38:52] mfossati: ack, there's one more at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1300 (my own). Then I can either do the one for robertsky, or we can swap to keep the order. [13:39:26] skipping Nvdtn19 and phuedx who haven't ack'ed yet to my knowledge. [13:40:02] (03CR) 10Krinkle: [C:03+2] Enable wgTrackMediaRequestProvenance on remaining Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269442 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle) [13:40:17] (03PS1) 10CWilliams: instances.yaml: Decommissioning db2151.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287420 (https://phabricator.wikimedia.org/T424343) [13:40:28] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:40:57] (03Merged) 10jenkins-bot: Enable wgTrackMediaRequestProvenance on remaining Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269442 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle) [13:41:11] (03CR) 10Atsuko: [C:03+1] dse-k8s: roll back unnecessary TLS changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287419 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [13:41:50] Krinkle: all right, can we please keep the order? [13:41:59] PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100% [13:41:59] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285913|Add ReadingLists Account Creation CTA campaign (T422169)]], [[gerrit:1286327|WelcomeSurvey: Respect returnTo for campaigns skipping the survey (T422169)]] (duration: 12m 33s) [13:42:03] T422169: Account Creation CTA experiment: handle experience after account creation - https://phabricator.wikimedia.org/T422169 [13:42:34] mfossati: sure, np. [13:42:53] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1269442|Enable wgTrackMediaRequestProvenance on remaining Wikipedias (T414338)]] [13:42:55] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1280.eqiad.wmnet with OS bookworm [13:42:57] T414338: FY25-26 WE5.4.12: Identify the provenance of image requests - https://phabricator.wikimedia.org/T414338 [13:43:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921648 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1280.eqiad.wmnet with OS bookworm [13:43:45] (03PS1) 10CWilliams: instances.yaml: Decommissioning db2150.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287421 (https://phabricator.wikimedia.org/T424342) [13:44:07] (03CR) 10Bking: [C:03+2] dse-k8s: roll back unnecessary TLS changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287419 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [13:44:15] (03PS1) 10Bearloga: EventStreamConfig: fix product_metrics.web_base [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287422 (https://phabricator.wikimedia.org/T426209) [13:44:41] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1269442|Enable wgTrackMediaRequestProvenance on remaining Wikipedias (T414338)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:45:11] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:45:23] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1279.eqiad.wmnet with reason: host reimage [13:45:49] !log krinkle@deploy1003 krinkle: Continuing with deployment [13:46:03] RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 32.98 ms [13:46:50] (03PS1) 10Hnowlan: corto: set default visibility to WMF-NDA [puppet] - 10https://gerrit.wikimedia.org/r/1287424 (https://phabricator.wikimedia.org/T426137) [13:47:36] (03PS1) 10Elukey: sre.hosts.provision: add sretest2010 to SUPERMICRO_NO_FQDN_MANAGEMENT [cookbooks] - 10https://gerrit.wikimedia.org/r/1287425 [13:47:55] (03CR) 10Marostegui: [C:03+1] instances.yaml: Decommissioning db2150.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287421 (https://phabricator.wikimedia.org/T424342) (owner: 10CWilliams) [13:48:18] (03CR) 10Marostegui: [C:03+1] instances.yaml: Decommissioning db2151.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287420 (https://phabricator.wikimedia.org/T424343) (owner: 10CWilliams) [13:48:31] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:49:14] Krinkle: My bad. I lost track of time. I'm here if there's still space [13:49:19] I recall our deployment calendar previously stating a maximum number of patches per window. It seems this is no longer there. However, per T225730 CI for anything other than a pure config patch is upto 20min these days. Anyway, it is what it is. [13:49:20] T225730: Reduce runtime of MW shared gate Jenkins jobs to 5 min - https://phabricator.wikimedia.org/T225730 [13:49:52] phuedx: ack, in a minute, mfossati is up. After that I'm doing robertsky and can try to do yours as well, but i'll be after the hour is up. [13:49:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1279.eqiad.wmnet with reason: host reimage [13:49:57] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269442|Enable wgTrackMediaRequestProvenance on remaining Wikipedias (T414338)]] (duration: 07m 03s) [13:50:00] T414338: FY25-26 WE5.4.12: Identify the provenance of image requests - https://phabricator.wikimedia.org/T414338 [13:50:06] PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:19] mfossati: you're up :) [13:50:23] Krinkle: https://wikitech.wikimedia.org/wiki/Backport_windows#Guidelines has the line "Our windows have a soft limit of 6 patches", that might be what you're remembering maybe? [13:50:45] I see. It used to be on the calendar itself e.g. in the heading or caption at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1300 [13:50:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287363 (https://phabricator.wikimedia.org/T426247) (owner: 10Marco Fossati) [13:50:48] no, that used to be right on the calendar [13:50:59] ah, ack [13:51:04] A_smart_kitten: anyway, thanks for finding that. Good to know [13:51:26] it was removed in a task I can't immediately find after someone (me iirc) pointed out that no-one followed that, on the argument that a patch limit is not the good thing to measure because a single sync can do multiple patches [13:51:53] fair enough. and if it's all config patches, one could do them in 5-10min each [13:51:53] (03PS2) 10Tiziano Fogli: thanos/compact: avoid constant Puppet changes [puppet] - 10https://gerrit.wikimedia.org/r/1273762 (https://phabricator.wikimedia.org/T386911) [13:52:06] realistically 6 people is perhaps a better limit than number of patches [13:52:10] (03Merged) 10jenkins-bot: dse-k8s: roll back unnecessary TLS changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287419 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [13:52:11] https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/43 [13:52:30] we have 7 this time, and given mulitpole involve a MW patch, that'll take 1.5h in total [13:52:42] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2150.codfw.wmnet with reason: Depooled host, will be decommissioned [13:53:04] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Depooled host, will be decommissioned [13:53:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance [13:53:10] (7ppl, 8 patches) [13:53:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2165 (T419635)', diff saved to https://phabricator.wikimedia.org/P92533 and previous config saved to /var/cache/conftool/dbconfig/20260514-135315-fceratto.json [13:53:20] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:53:26] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:53:28] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2152.codfw.wmnet with reason: Depooled host, will be decommissioned [13:53:28] (03Merged) 10jenkins-bot: Scale share-highlight card to fit small viewports [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287363 (https://phabricator.wikimedia.org/T426247) (owner: 10Marco Fossati) [13:53:44] !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:53:46] !log mfossati@deploy1003 Started scap sync-world: Backport for [[gerrit:1287363|Scale share-highlight card to fit small viewports (T426247)]] [13:53:50] T426247: Share Highlight: Ensure dialog header is visible on small devices - https://phabricator.wikimedia.org/T426247 [13:54:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11921702 (10Jhancock.wm) [13:54:16] now that the addition is tool-assisted, one could compute an estimate, e.g. per person, assign a 5-min or 20min estimate (if it includes a MW patch), and once it is >= 60min, don't advertise the window anymore as available. [13:54:25] !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:54:37] !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:54:50] RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 32.95 ms [13:55:10] Krinkle: +1 to that [13:55:14] ACKNOWLEDGEMENT - MariaDB Replica Lag: backup1-eqiad on db1205 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 560.13 seconds Jcrespo expected - The acknowledgement expires at: 2026-05-15 18:54:58. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:55:34] !log mfossati@deploy1003 mfossati: Backport for [[gerrit:1287363|Scale share-highlight card to fit small viewports (T426247)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:55:41] Krinkle: And a 45 min if an MW patch that touches i18n? [13:56:01] !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:56:03] testing, please hold on [13:56:04] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [13:56:19] 5 mins for a config patch seems a bit tight [13:56:22] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:56:23] but otherwise +1 to the idea [13:56:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T419635)', diff saved to https://phabricator.wikimedia.org/P92534 and previous config saved to /var/cache/conftool/dbconfig/20260514-135626-fceratto.json [13:56:42] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:56:44] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: add sretest2010 to SUPERMICRO_NO_FQDN_MANAGEMENT [cookbooks] - 10https://gerrit.wikimedia.org/r/1287425 (owner: 10Elukey) [13:56:46] Used to be 45 seconds for config patches. [13:56:49] !log mfossati@deploy1003 mfossati: Continuing with deployment [13:56:50] !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:56:54] * James_F shakes his cane at the passing of the times. [13:57:04] !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:57:29] taavi: ack. not saying it'll be enforced at runtime, just about whether or not to prevent scheduling. I'd rather the tool allow too many than too few and become circimvented/ignored. [13:58:04] James_F: hehe, hopefully not for much longer once we get the new l10n format live. [13:58:14] Krinkle: We'll see. [13:58:23] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1280.eqiad.wmnet with reason: host reimage [13:58:35] (03CR) 10CWilliams: [C:03+2] instances.yaml: Decommissioning db2151.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287420 (https://phabricator.wikimedia.org/T424343) (owner: 10CWilliams) [13:59:24] really sorry for the last minute patch. we got the ip addresses only today. >.< and the conference is tomorrow. [13:59:46] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1282.eqiad.wmnet with OS bookworm [13:59:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921768 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1282.eqiad.wmnet with OS bookworm [14:00:04] (03PS1) 10Btullis: Add support for creating arbitrary PVCs to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287426 (https://phabricator.wikimedia.org/T422179) [14:00:25] (03PS1) 10Sbisson: Simplewiki: include article wizard in AG experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287427 (https://phabricator.wikimedia.org/T426278) [14:00:56] !log mfossati@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287363|Scale share-highlight card to fit small viewports (T426247)]] (duration: 07m 09s) [14:00:59] T426247: Share Highlight: Ensure dialog header is visible on small devices - https://phabricator.wikimedia.org/T426247 [14:01:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287427 (https://phabricator.wikimedia.org/T426278) (owner: 10Sbisson) [14:01:05] done! [14:01:11] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Remove db2151 from dbctl T424343', diff saved to https://phabricator.wikimedia.org/P92535 and previous config saved to /var/cache/conftool/dbconfig/20260514-140110-cwilliams.json [14:01:14] T424343: decommission db2151.codfw.wmnet - https://phabricator.wikimedia.org/T424343 [14:01:16] (03CR) 10Tiziano Fogli: thanos/compact: avoid constant Puppet changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1273762 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [14:02:30] (03PS2) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.11.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1287415 [14:03:24] robertsky: reviewing yours now [14:03:38] elukey@cumin1003 reimage (PID 3911913) is awaiting input [14:03:42] 06SRE, 10corto, 10Incident Tooling, 13Patch-For-Review: Increase trusted volunteer's visibility into production incidents - https://phabricator.wikimedia.org/T426137#11921785 (10A_smart_kitten) (cross-referencing to {T389664}, where the default visibility of incident tasks was previously discussed FWICS) [14:04:05] (03PS1) 10Atsuko: services_proxy: enabling toolhub and ttmserver [puppet] - 10https://gerrit.wikimedia.org/r/1287428 (https://phabricator.wikimedia.org/T424248) [14:04:35] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1280.eqiad.wmnet with reason: host reimage [14:04:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky) [14:05:15] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [14:05:48] (03Merged) 10jenkins-bot: throttle rule for ESEAP Conference 2026 15-18 May 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287367 (https://phabricator.wikimedia.org/T426295) (owner: 10Robertsky) [14:06:06] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1287367|throttle rule for ESEAP Conference 2026 15-18 May 2026 (T426295)]] [14:06:10] T426295: Request throttle exemption of IP addresses for ESEAP Conference 2026 - https://phabricator.wikimedia.org/T426295 [14:06:24] Krinkle, will need to run maintenance script: https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold#Reset [14:06:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P92536 and previous config saved to /var/cache/conftool/dbconfig/20260514-140635-fceratto.json [14:06:57] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [14:07:06] to clear the cache. [14:07:15] (03CR) 10Btullis: "Good question. I checked the output from this command:" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [14:07:32] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [14:07:33] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1279.eqiad.wmnet with OS bookworm [14:07:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1279.eqiad.wmnet with OS bookworm completed: - db1279 (**PASS**) -... [14:07:58] !log krinkle@deploy1003 krinkle, robertsky: Backport for [[gerrit:1287367|throttle rule for ESEAP Conference 2026 15-18 May 2026 (T426295)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:08:22] nothing to test... [14:08:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11921799 (10Jhancock.wm) [14:08:32] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11921800 (10Jhancock.wm) [14:08:42] (03CR) 10Novem Linguae: "Does CortoBot need to be added to WMF-NDA before this patch is merged? This might avoid a similar issue to what happened in T389664#106960" [puppet] - 10https://gerrit.wikimedia.org/r/1287424 (https://phabricator.wikimedia.org/T426137) (owner: 10Hnowlan) [14:09:06] test servers accessible [14:09:40] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [14:09:43] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [14:09:57] !log krinkle@deploy1003 krinkle, robertsky: Continuing with deployment [14:10:02] ack [14:10:53] phuedx: are you okay self-servicing? [14:11:54] (03PS2) 10CWilliams: instances.yaml: Decommissioning db2150.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287421 (https://phabricator.wikimedia.org/T424342) [14:12:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11921818 (10Jhancock.wm) [14:12:35] (03CR) 10Btullis: "Yes, confirmed:" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [14:13:04] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [14:14:07] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287367|throttle rule for ESEAP Conference 2026 15-18 May 2026 (T426295)]] (duration: 08m 00s) [14:14:10] T426295: Request throttle exemption of IP addresses for ESEAP Conference 2026 - https://phabricator.wikimedia.org/T426295 [14:14:56] Hi, I have a no-op config change to deploy, let me know when then the window is clear! Krinkle it looks liek you are waiting for phuedx ? [14:15:06] Krinkle: Can do [14:15:25] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1282.eqiad.wmnet with reason: host reimage [14:16:35] (03CR) 10CWilliams: [C:03+2] instances.yaml: Decommissioning db2150.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1287421 (https://phabricator.wikimedia.org/T424342) (owner: 10CWilliams) [14:16:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P92537 and previous config saved to /var/cache/conftool/dbconfig/20260514-141644-fceratto.json [14:17:04] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1284] - vriley@cumin1003" [14:17:10] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1284] - vriley@cumin1003" [14:17:10] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:17:25] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1284 [14:18:13] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Remove db2150 from dbctl T424342', diff saved to https://phabricator.wikimedia.org/P92538 and previous config saved to /var/cache/conftool/dbconfig/20260514-141812-cwilliams.json [14:18:17] T424342: decommission db2150.codfw.wmnet - https://phabricator.wikimedia.org/T424342 [14:18:18] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1283.eqiad.wmnet with OS bookworm [14:18:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1283.eqiad.wmnet with OS bookworm [14:18:38] (03CR) 10CI reject: [V:04-1] CHANGELOG: add changelogs for release v0.11.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1287415 (owner: 10Cathal Mooney) [14:18:51] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:18:53] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:18:53] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:18:53] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:18:53] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:18:53] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:18:53] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:18:53] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:18:54] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1282.eqiad.wmnet with reason: host reimage [14:18:54] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:18:54] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:18:55] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:18:55] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:18:56] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:19:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287368 (https://phabricator.wikimedia.org/T419514) (owner: 10Phuedx) [14:19:15] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [14:19:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2165 (T419961)', diff saved to https://phabricator.wikimedia.org/P92539 and previous config saved to /var/cache/conftool/dbconfig/20260514-141922-fceratto.json [14:19:36] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1284 [14:19:41] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [14:19:43] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [14:19:43] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [14:19:43] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [14:19:43] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [14:19:43] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [14:19:43] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [14:19:44] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [14:19:44] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [14:19:45] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift [14:19:45] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Swift [14:19:46] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Swift [14:19:47] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [14:20:12] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1284.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:20:15] (03CR) 10Hnowlan: "corto is already a member of WMF-NDA: https://phabricator.wikimedia.org/project/members/61/" [puppet] - 10https://gerrit.wikimedia.org/r/1287424 (https://phabricator.wikimedia.org/T426137) (owner: 10Hnowlan) [14:21:33] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [14:22:21] (03CR) 10Novem Linguae: "Looks like it's the 6th project and I only looked at the first 5. Naturally :) Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1287424 (https://phabricator.wikimedia.org/T426137) (owner: 10Hnowlan) [14:22:28] (03Merged) 10jenkins-bot: ext.wikimediaEvents: Add synth-aa-ncs-1 experiment [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287368 (https://phabricator.wikimedia.org/T419514) (owner: 10Phuedx) [14:22:32] Krinkle, can you run the following comannds to clear the cache for the throttle? iirc, it takes 3 days for the cache to expire, if any? https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold#Reset [14:22:44] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1287368|ext.wikimediaEvents: Add synth-aa-ncs-1 experiment (T419514)]] [14:22:48] T419514: Run a synthetic A/A non-cache-splitting experiment - https://phabricator.wikimedia.org/T419514 [14:23:02] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [14:23:03] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1280.eqiad.wmnet with OS bookworm [14:23:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1280.eqiad.wmnet with OS bookworm completed: - db1280 (**PASS**) -... [14:24:23] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [14:24:31] !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1287368|ext.wikimediaEvents: Add synth-aa-ncs-1 experiment (T419514)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:24:45] robertsky: https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/b474e435e552b95feffaacf79d8553f0d1766545/wmf-config/InitialiseSettings.php#2406 I don't see any limit there longer than 24h [14:25:13] in any event, afaik the limit itself is not part the cache, only the current count [14:25:24] ok [14:25:26] clearing the cache is about resetting it if you need more than the curently configured limit without increasing it [14:25:36] i.e. instead of a patch like the one you made. [14:25:49] ok [14:25:59] thanks! [14:26:10] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [14:26:21] yw [14:26:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T419961)', diff saved to https://phabricator.wikimedia.org/P92540 and previous config saved to /var/cache/conftool/dbconfig/20260514-142650-fceratto.json [14:27:06] (03PS4) 10Ilias Sarantopoulos: ml-services: add qwen36-27b to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) [14:27:29] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [14:29:30] JavaScript console looks clear on regular navigation. No ResourceLoader errors etc. Continuing [14:29:46] !log phuedx@deploy1003 phuedx: Continuing with deployment [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1430) [14:31:46] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1285] - vriley@cumin1003" [14:31:52] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1285] - vriley@cumin1003" [14:31:52] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:32:16] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1285 [14:33:05] (03CR) 10Dragoniez: "@fd7ezs8cx@mozmail.com When the backport window opens you must be available in IRC's #wikimedia-operations channel (see https://wikitech.w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (https://phabricator.wikimedia.org/T405724) (owner: 10Nvdtn19) [14:33:47] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1285 [14:33:48] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1283.eqiad.wmnet with reason: host reimage [14:33:56] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [14:33:58] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287368|ext.wikimediaEvents: Add synth-aa-ncs-1 experiment (T419514)]] (duration: 11m 14s) [14:34:02] T419514: Run a synthetic A/A non-cache-splitting experiment - https://phabricator.wikimedia.org/T419514 [14:35:20] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [14:35:23] (03CR) 10Federico Ceratto: "Yes the test ran fine." [cookbooks] - 10https://gerrit.wikimedia.org/r/1285355 (https://phabricator.wikimedia.org/T425417) (owner: 10Federico Ceratto) [14:35:37] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [14:35:38] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1282.eqiad.wmnet with OS bookworm [14:35:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11921951 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1282.eqiad.wmnet with OS bookworm completed: - db1282 (**PASS**) -... [14:36:33] ottomata: Done! [14:36:42] /cc Krinkle [14:37:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P92541 and previous config saved to /var/cache/conftool/dbconfig/20260514-143659-fceratto.json [14:37:02] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1289 [14:37:03] !log vriley@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host db1289 [14:37:38] (03CR) 10Kevin Bazira: ml-services: add qwen36-27b to experimental ns (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) (owner: 10Ilias Sarantopoulos) [14:37:41] vriley@cumin1003 provision (PID 3939945) is awaiting input [14:38:40] (03PS5) 10Ilias Sarantopoulos: ml-services: add qwen36-27b to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) [14:38:42] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1287] - vriley@cumin1003" [14:38:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1287] - vriley@cumin1003" [14:38:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:38:54] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1284.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:39:03] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1283.eqiad.wmnet with reason: host reimage [14:39:44] (03PS6) 10Ilias Sarantopoulos: ml-services: add qwen36-27b to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) [14:40:23] (03CR) 10Ilias Sarantopoulos: ml-services: add qwen36-27b to experimental ns (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) (owner: 10Ilias Sarantopoulos) [14:41:27] (03CR) 10Dragoniez: "See also https://wikitech.wikimedia.org/wiki/Backport_windows#How_to_submit_a_patch_for_backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (https://phabricator.wikimedia.org/T405724) (owner: 10Nvdtn19) [14:42:46] (03CR) 10Santiago Faci: [C:03+1] EventStreamConfig: fix product_metrics.web_base [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287422 (https://phabricator.wikimedia.org/T426209) (owner: 10Bearloga) [14:42:53] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: FY2526 Q3 ulsfo: switch refresh - https://phabricator.wikimedia.org/T408510#11921972 (10RobH) [14:44:15] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1285.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:45:48] (03PS3) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.11.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1287415 [14:46:51] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1287.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:47:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P92542 and previous config saved to /var/cache/conftool/dbconfig/20260514-144707-fceratto.json [14:47:18] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:47:20] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:48:14] (03PS1) 10Codename Noreste: Restrict the changetags user right to bots and sysops on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287433 (https://phabricator.wikimedia.org/T355445) [14:49:45] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [14:49:51] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:51:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287433 (https://phabricator.wikimedia.org/T355445) (owner: 10Codename Noreste) [14:52:23] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1284.eqiad.wmnet with OS bookworm [14:52:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1284.eqiad.wmnet with OS bookworm [14:53:44] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1288] - vriley@cumin1003" [14:53:50] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1288] - vriley@cumin1003" [14:53:50] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:53:53] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [14:54:06] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1288 [14:54:08] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [14:54:30] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [14:54:31] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1283.eqiad.wmnet with OS bookworm [14:54:36] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922009 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1283.eqiad.wmnet with OS bookworm completed: - db1283 (**PASS**) -... [14:54:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:55:06] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1285.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:55:18] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1288 [14:56:19] vriley@cumin1003 provision (PID 3948521) is awaiting input [14:57:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T419961)', diff saved to https://phabricator.wikimedia.org/P92544 and previous config saved to /var/cache/conftool/dbconfig/20260514-145715-fceratto.json [14:57:19] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1288.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:58:49] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1287.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:59:40] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287428 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [14:59:50] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [14:59:51] (03CR) 10Bking: [C:03+1] services_proxy: enabling toolhub and ttmserver [puppet] - 10https://gerrit.wikimedia.org/r/1287428 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [15:00:04] andre and brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1500). [15:00:44] jouncebot: well, European Train log triage was six hours ago according to the Google calendar [15:01:43] (03CR) 10Bking: [C:03+1] "Feel free to add `opensearch-ttmserver` and `opensearch-toolhub` as well, not just the `-test` versions." [puppet] - 10https://gerrit.wikimedia.org/r/1287428 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [15:04:02] (03CR) 10Thcipriani: "I think this is the problem you're seeing in devtools deploying phab to an upgraded host. Without a puppetserver there, it's hard to verif" [puppet] - 10https://gerrit.wikimedia.org/r/1287023 (https://phabricator.wikimedia.org/T424055) (owner: 10Thcipriani) [15:04:24] (03CR) 10Ottomata: "Hm, this could be an issue. Will every message in the topic match the ECS schema? If not, we'll have to filter the topic for the specific" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [15:05:40] (03CR) 10Cathal Mooney: [C:03+2] CHANGELOG: add changelogs for release v0.11.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1287415 (owner: 10Cathal Mooney) [15:07:28] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1285.eqiad.wmnet with OS bookworm [15:07:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922056 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1285.eqiad.wmnet with OS bookworm [15:08:05] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1287.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:08:13] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1284.eqiad.wmnet with reason: host reimage [15:08:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bearloga@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287422 (https://phabricator.wikimedia.org/T426209) (owner: 10Bearloga) [15:08:43] (03CR) 10Kevin Bazira: [C:03+1] ml-services: add qwen36-27b to experimental ns (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) (owner: 10Ilias Sarantopoulos) [15:09:00] (03CR) 10BryanDavis: "Cause of https://phabricator.wikimedia.org/T425687 "No Puppet resources found on instance deployment-mx04 on project deployment-prep"" [puppet] - 10https://gerrit.wikimedia.org/r/1283025 (https://phabricator.wikimedia.org/T325394) (owner: 10Muehlenhoff) [15:09:51] RESOLVED: [3x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:09:57] (03Merged) 10jenkins-bot: EventStreamConfig: fix product_metrics.web_base [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287422 (https://phabricator.wikimedia.org/T426209) (owner: 10Bearloga) [15:10:14] !log bearloga@deploy1003 Started scap sync-world: Backport for [[gerrit:1287422|EventStreamConfig: fix product_metrics.web_base (T426209)]] [15:10:17] T426209: Explicitly declare absence of contextual attributes in product_metrics.web_base stream - https://phabricator.wikimedia.org/T426209 [15:12:01] !log bearloga@deploy1003 bearloga: Backport for [[gerrit:1287422|EventStreamConfig: fix product_metrics.web_base (T426209)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:12:07] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1287.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:12:21] !log bearloga@deploy1003 bearloga: Continuing with deployment [15:14:43] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1284.eqiad.wmnet with reason: host reimage [15:14:58] (03CR) 10Atsuko: [C:03+2] services_proxy: enabling toolhub and ttmserver [puppet] - 10https://gerrit.wikimedia.org/r/1287428 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [15:15:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11922147 (10Jhancock.wm) [15:16:13] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1288.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:16:34] !log bearloga@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287422|EventStreamConfig: fix product_metrics.web_base (T426209)]] (duration: 06m 20s) [15:16:37] T426209: Explicitly declare absence of contextual attributes in product_metrics.web_base stream - https://phabricator.wikimedia.org/T426209 [15:18:57] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:18:57] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:18:57] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:18:57] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:18:57] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:18:57] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:18:57] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:18:58] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:18:58] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:18:59] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:19:01] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:19:47] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [15:19:47] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [15:19:47] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [15:19:47] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [15:19:47] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [15:19:47] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [15:19:48] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift [15:19:48] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift [15:19:49] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [15:19:49] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [15:19:51] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [15:20:07] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.11.2 [software/homer] - 10https://gerrit.wikimedia.org/r/1287415 (owner: 10Cathal Mooney) [15:22:13] (03CR) 10Kosta Harlan: [C:03+1] .gitignore: Add /static/hcaptcha/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287026 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [15:23:14] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1285.eqiad.wmnet with reason: host reimage [15:25:17] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [15:29:26] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1285.eqiad.wmnet with reason: host reimage [15:29:54] (03CR) 10Btullis: [C:03+2] mediawiki-dumps-legacy: Allow launching dumps from airflow-wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287411 (https://phabricator.wikimedia.org/T422179) (owner: 10Btullis) [15:31:51] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: Allow launching dumps from airflow-wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287411 (https://phabricator.wikimedia.org/T422179) (owner: 10Btullis) [15:31:56] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [15:32:38] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [15:32:39] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1284.eqiad.wmnet with OS bookworm [15:32:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922233 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1284.eqiad.wmnet with OS bookworm completed: - db1284 (**PASS**) -... [15:33:36] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1288.eqiad.wmnet with OS bookworm [15:33:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1288.eqiad.wmnet with OS bookworm [15:35:16] (03PS1) 10Cathal Mooney: Release v0.11.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1287436 [15:35:24] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [15:37:47] 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim with Postfix on mail servers - https://phabricator.wikimedia.org/T325394#11922256 (10bd808) [15:40:43] (03PS2) 10Cathal Mooney: Release v0.11.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1287436 [15:41:11] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1289] - vriley@cumin1003" [15:41:17] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1289] - vriley@cumin1003" [15:41:17] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:30] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1289 [15:42:28] 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim with Postfix on mail servers - https://phabricator.wikimedia.org/T325394#11922304 (10bd808) [15:42:51] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1289 [15:45:37] 10SRE-SLO, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q4): Grafana: deploy grafana-dashboard-reporter-app - https://phabricator.wikimedia.org/T425795#11922323 (10hnowlan) [15:45:51] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [15:45:56] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:46:10] (03CR) 10Cathal Mooney: [C:03+2] Release v0.11.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1287436 (owner: 10Cathal Mooney) [15:46:47] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1006.eqiad.wmnet with OS trixie [15:47:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11922328 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host kafka-logging1006.eqiad.wmnet with OS tr... [15:48:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [15:48:54] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1285.eqiad.wmnet with OS bookworm [15:49:03] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1285.eqiad.wmnet with OS bookworm completed: - db1285 (**PASS**) -... [15:49:27] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: Release v0.11.2 - cmooney@cumin1003 [15:49:37] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1288.eqiad.wmnet with reason: host reimage [15:50:05] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [15:51:07] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: Release v0.11.2 - cmooney@cumin1003 [15:52:21] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:53:19] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:54:49] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1288.eqiad.wmnet with reason: host reimage [15:54:57] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1290] - vriley@cumin1003" [15:55:03] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1290] - vriley@cumin1003" [15:55:03] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:55:22] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1290 [15:56:42] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1290 [15:57:43] (03CR) 10Btullis: "Yes, I agree. i think that this will be an issue." [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [15:57:47] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:59:12] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:59:16] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:59:18] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [16:00:05] jhathaway and rzl: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1600). [16:00:05] Dreamy_Jazz: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:33] o/ looking [16:01:40] Dreamy_Jazz: pretty foolproof from a puppet pov :) I'm not reviewing from a "maintenance script does the right thing" perspective but I figure you did [16:01:49] will you want to kick off a test run, or just let the next one happen on schedule? [16:01:55] (03CR) 10RLazarus: [C:03+2] purge_securepoll: don't exclude private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1279281 (https://phabricator.wikimedia.org/T419309) (owner: 10Novem Linguae) [16:03:59] 10SRE-swift-storage, 10Cloud-VPS (Quota-requests): Quota increase request for project swift - https://phabricator.wikimedia.org/T425975#11922386 (10taavi) 05Open→03Resolved a:03taavi [16:04:05] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [16:06:45] \o [16:06:51] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [16:06:52] Sorry, didn't see pings until now [16:06:57] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [16:07:08] Should be fine for the next one to happen on schedule [16:07:17] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove records for deleted IPs esams,drmrs and magru - cmooney@cumin1003" [16:07:29] And yes, I reviewed from a maintenance script does the right thing" [16:07:31] FIRING: Traffic on tunnel link: Alert for device cr1-drmrs.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [16:07:34] (03PS1) 10Cathal Mooney: Remove INCLUDE statements for CR<->CR link networks no longer used [dns] - 10https://gerrit.wikimedia.org/r/1287439 (https://phabricator.wikimedia.org/T424611) [16:07:38] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove records for deleted IPs esams,drmrs and magru - cmooney@cumin1003" [16:07:38] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:07:50] Dreamy_Jazz: great thanks :) [16:08:10] mw-cron had some existing undeployed diffs, just checking those before I push this out [16:09:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:11:32] aha, the diffs are https://gerrit.wikimedia.org/r/c/operations/puppet/+/1280431 [16:11:33] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [16:12:31] RESOLVED: Traffic on tunnel link: Device cr1-drmrs.wikimedia.org recovered from Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [16:12:48] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [16:13:54] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [16:13:55] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1288.eqiad.wmnet with OS bookworm [16:14:04] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:14:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922442 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1288.eqiad.wmnet with OS bookworm completed: - db1288 (**PASS**) -... [16:14:10] (03PS1) 10Cathal Mooney: common.yaml: remove OSPF definitions for esams/drmrs/magru cr links [homer/public] - 10https://gerrit.wikimedia.org/r/1287440 (https://phabricator.wikimedia.org/T424611) [16:15:38] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:15:57] (03CR) 10Cathal Mooney: [C:03+2] common.yaml: remove OSPF definitions for esams/drmrs/magru cr links [homer/public] - 10https://gerrit.wikimedia.org/r/1287440 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [16:16:06] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:16:26] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:19] (03Merged) 10jenkins-bot: common.yaml: remove OSPF definitions for esams/drmrs/magru cr links [homer/public] - 10https://gerrit.wikimedia.org/r/1287440 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [16:17:51] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [16:18:57] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:18:57] PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:18:57] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:18:57] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:18:57] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:18:59] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:19:01] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:19:07] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:19:47] RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [16:19:47] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [16:19:47] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [16:19:47] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Swift [16:19:47] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Swift [16:19:51] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [16:19:57] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 5.756 second response time https://wikitech.wikimedia.org/wiki/Swift [16:19:57] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [16:19:59] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift [16:20:32] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [16:20:51] (no cronjobs in codfw but I'm deploying just to clean up the unapplied diffs in the networkpolicy ) [16:20:51] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922517 (10VRiley-WMF) [16:20:56] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [16:21:00] (03CR) 10AKhatun: [C:03+1] stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287366 (https://phabricator.wikimedia.org/T423920) (owner: 10JavierMonton) [16:21:14] !log disable core router direct link at magru now that traffic is flowing via switches T424611 [16:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:18] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [16:21:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922522 (10VRiley-WMF) [16:21:45] Dreamy_Jazz: done! [16:21:49] puppet window complete [16:21:51] Thanks [16:22:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922528 (10VRiley-WMF) Looking into the other servers to see what issues they may be. Suspected wrong cable ports, cable issues or mislabed [16:22:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922533 (10VRiley-WMF) 05In progress→03Open [16:25:11] !log disable core router direct link at drmrs now that traffic is flowing via switches T424611 [16:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:46] !log disable core router direct link at esams now that traffic is flowing via switches T424611 [16:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:50] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [16:33:20] (03CR) 10Dzahn: "Thank you! there is actually a puppetserver there. but I was hoping we could stop using it and switch phab test instances back to the glo" [puppet] - 10https://gerrit.wikimedia.org/r/1287023 (https://phabricator.wikimedia.org/T424055) (owner: 10Thcipriani) [16:34:23] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:24] (03CR) 10Ottomata: [C:03+1] stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287366 (https://phabricator.wikimedia.org/T423920) (owner: 10JavierMonton) [16:35:50] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-logging1006.eqiad.wmnet with OS trixie [16:36:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11922644 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host kafka-logging1006.eqiad.wmnet with OS trixie... [16:36:08] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [16:36:12] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [16:37:16] (03CR) 10Dzahn: "puppet compiler seems broken: Failed to execute '/pdb/query/v4' on at least 1 of the following 'server_urls': https://pcc-worker1006.pupp" [puppet] - 10https://gerrit.wikimedia.org/r/1287023 (https://phabricator.wikimedia.org/T424055) (owner: 10Thcipriani) [16:44:14] (03CR) 10Anzx: "seems ok to keep it as it is" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah) [16:44:17] (03CR) 10Dzahn: [C:03+2] Phabricator: require config before scap [puppet] - 10https://gerrit.wikimedia.org/r/1287023 (https://phabricator.wikimedia.org/T424055) (owner: 10Thcipriani) [16:48:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:48:38] (03PS1) 10Dzahn: phabricator::migration: require config before scap [puppet] - 10https://gerrit.wikimedia.org/r/1287447 (https://phabricator.wikimedia.org/T424055) [16:49:11] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [16:49:16] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [16:49:17] (03CR) 10Dzahn: [C:03+2] "doing the same thing in the phabricator::migration class which we made to allow setting up new prod phabricator servers: https://gerrit.w" [puppet] - 10https://gerrit.wikimedia.org/r/1287023 (https://phabricator.wikimedia.org/T424055) (owner: 10Thcipriani) [16:53:07] (03CR) 10Ottomata: "Hm, nice! Could we do that to produce it to both $dc.$meta.stream and to rsyslog-$severity? Doing so would get the data in logstash as w" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [16:53:10] (03CR) 10Dzahn: [C:04-2] "probably breaks puppet" [puppet] - 10https://gerrit.wikimedia.org/r/1287447 (https://phabricator.wikimedia.org/T424055) (owner: 10Dzahn) [16:53:26] (03CR) 10Dzahn: [C:03+2] "unfortunately: nope. dependency cycle." [puppet] - 10https://gerrit.wikimedia.org/r/1287023 (https://phabricator.wikimedia.org/T424055) (owner: 10Thcipriani) [16:53:59] (03PS1) 10Dzahn: Revert "Phabricator: require config before scap" [puppet] - 10https://gerrit.wikimedia.org/r/1287450 [16:56:34] (03CR) 10Dzahn: [C:03+2] Revert "Phabricator: require config before scap" [puppet] - 10https://gerrit.wikimedia.org/r/1287450 (owner: 10Dzahn) [16:57:52] (03CR) 10Ottomata: "I was looking for prior art here. I recall that the `mediawiki.client.error` stream is produced to kafka logging clusters by eventgate-lo" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [16:58:06] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T426298 [17:00:05] bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1700) [17:03:41] FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:44] * bd808 checks for deployable increments [17:06:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new line card in cr2-eqiad slot 0, move card from slot 1 to cr1-eqiad slot 0 and configure - https://phabricator.wikimedia.org/T426343 (10cmooney) 03NEW p:05Triage→03Medium [17:06:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new line card in cr2-eqiad slot 0, move card from slot 1 to cr1-eqiad slot 0 and configure - https://phabricator.wikimedia.org/T426343#11922831 (10cmooney) [17:06:58] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T426298 [17:08:19] (03PS1) 10BryanDavis: developer-portal: Bump container to 2026-05-11-122319-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287452 [17:08:42] (03CR) 10Cathal Mooney: [C:03+2] Remove INCLUDE statements for CR<->CR link networks no longer used [dns] - 10https://gerrit.wikimedia.org/r/1287439 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [17:09:11] !log cmooney@dns2005 START - running authdns-update [17:10:27] !log cmooney@dns2005 END - running authdns-update [17:11:17] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2026-05-11-122319-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287452 (owner: 10BryanDavis) [17:13:30] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2026-05-11-122319-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287452 (owner: 10BryanDavis) [17:14:01] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T426298 [17:14:19] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:14:21] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:15:58] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:16:13] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:16:37] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:16:53] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:17:03] 06SRE, 10corto, 10Incident Tooling, 13Patch-For-Review: Increase trusted volunteer's visibility into production incidents - https://phabricator.wikimedia.org/T426137#11922867 (10hnowlan) Some initial context: The kinds of issues SRE are dealing with have changed significantly in the last ~year. Historicall... [17:17:26] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:17:47] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:17:48] (03PS2) 10Dzahn: zuul: replace user/group setup with systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1286999 [17:18:52] That's all for my window. [17:19:01] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [17:19:43] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [17:19:50] (03CR) 10Dzahn: "Currently the user has a home dir /home/zuul but it does not exist:" [puppet] - 10https://gerrit.wikimedia.org/r/1286999 (owner: 10Dzahn) [17:19:51] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.233 second response time https://wikitech.wikimedia.org/wiki/Swift [17:20:33] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [17:21:11] (03CR) 10Dzahn: "we should talk to traffic about the port question" [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [17:23:03] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T426298 [17:24:34] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1285488/8560/codesearch9.codesearch.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn) [17:25:42] (03CR) 10Dzahn: [V:03+1] "let me know if you think the general idea is good/acceptable. then I will be bold to just merge it, no need to check the code details." [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn) [17:26:18] (03PS6) 10Jforrester: wikifunctions: Add releases function-evaluators in Rust, unused [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [17:26:23] (03PS7) 10Jforrester: wikifunctions: Add releases function-evaluators in Rust, unused [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [17:26:39] (03CR) 10Dzahn: [V:03+1] "for the future person seeing this: please do not punish me for having touched it last. just trying to be helpful with one particular incid" [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn) [17:27:30] (03PS8) 10Jforrester: wikifunctions: Add releases function-evaluators in Rust, unused [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [17:27:30] (03CR) 10Jforrester: wikifunctions: Add releases function-evaluators in Rust, unused (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [17:28:44] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11922906 (10Dzahn) 05In progress→03Stalled [17:28:48] (03CR) 10Jforrester: [C:03+2] wikifunctions: Add releases function-evaluators in Rust, unused [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [17:28:54] (03Abandoned) 10Dzahn: admin: add SSH key and kerberos for Annie Kim WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1284777 (https://phabricator.wikimedia.org/T420500) (owner: 10Dzahn) [17:29:53] (03CR) 10Dzahn: "stalled - waiting for feedback - moving back to WIP status" [puppet] - 10https://gerrit.wikimedia.org/r/1282395 (https://phabricator.wikimedia.org/T240266) (owner: 10Dzahn) [17:30:58] (03Merged) 10jenkins-bot: wikifunctions: Add releases function-evaluators in Rust, unused [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [17:31:14] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:32:06] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:38:40] (03CR) 10BPirkle: "Ack" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286862 (https://phabricator.wikimedia.org/T426081) (owner: 10Gkyziridis) [18:00:05] andre and brennen: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T1800). [18:08:31] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [18:09:04] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1274.eqiad.wmnet with OS bookworm [18:09:11] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11922998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1274.eqiad.wmnet with OS bookworm [18:10:43] (03PS5) 10Jdlrobson: Remove MinervaNightMode config after skin cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T415930) (owner: 10HakanIST) [18:14:08] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [18:16:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:17:22] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:19:07] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [18:19:43] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [18:19:43] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [18:19:43] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [18:19:57] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [18:20:33] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [18:20:33] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [18:20:33] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [18:22:13] vriley@cumin1003 provision (PID 4001665) is awaiting input [18:23:30] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [18:24:38] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [18:25:02] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1274.eqiad.wmnet with reason: host reimage [18:25:59] (03PS1) 10Andrew Bogott: magnum-cluster-api: try the latest version of container-api on the worker [puppet] - 10https://gerrit.wikimedia.org/r/1287457 [18:26:28] (03PS1) 10Jdlrobson: Limit $wgThumbLimits to three options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287441 (https://phabricator.wikimedia.org/T426328) [18:27:54] (03PS2) 10Andrew Bogott: magnum-cluster-api: try the latest version of container-api on the worker [puppet] - 10https://gerrit.wikimedia.org/r/1287457 [18:29:37] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1274.eqiad.wmnet with reason: host reimage [18:29:55] (03CR) 10Andrew Bogott: [V:03+2] magnum-cluster-api: try the latest version of container-api on the worker [puppet] - 10https://gerrit.wikimedia.org/r/1287457 (owner: 10Andrew Bogott) [18:30:07] (03CR) 10Andrew Bogott: [C:03+2] magnum-cluster-api: try the latest version of container-api on the worker [puppet] - 10https://gerrit.wikimedia.org/r/1287457 (owner: 10Andrew Bogott) [18:32:32] vriley@cumin1003 provision (PID 4001665) is awaiting input [18:33:53] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923052 (10VRiley-WMF) [18:36:53] 10SRE-Access-Requests, 06Data-Platform-SRE: Grant mcollins level 1 access to analytics-privatedata-users - https://phabricator.wikimedia.org/T426348#11923056 (10Ottomata) [18:38:20] 10SRE-Access-Requests, 06Data-Platform-SRE: Grant mcollins level 1 access to analytics-privatedata-users - https://phabricator.wikimedia.org/T426348#11923058 (10Ahoelzl) approved. [18:40:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1281.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:46:50] 10SRE-SLO, 10observability, 10Wikidata, 06Wikidata Platform Team, and 3 others: Update WDQS SLOs to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11923094 (10bking) [18:47:05] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [18:50:10] vriley@cumin1003 reimage (PID 4001255) is awaiting input [18:51:03] (03CR) 10RLazarus: wikifunctions: Add releases function-evaluators in Rust, unused (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [18:51:44] (03PS1) 10RLazarus: wikifunctions: Remove noop OTEL_EXPORTER_OTLP_ENDPOINT from releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287465 (https://phabricator.wikimedia.org/T423627) [18:53:57] (03CR) 10Jforrester: [C:03+2] wikifunctions: Remove noop OTEL_EXPORTER_OTLP_ENDPOINT from releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287465 (https://phabricator.wikimedia.org/T423627) (owner: 10RLazarus) [18:56:10] (03Merged) 10jenkins-bot: wikifunctions: Remove noop OTEL_EXPORTER_OTLP_ENDPOINT from releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287465 (https://phabricator.wikimedia.org/T423627) (owner: 10RLazarus) [18:57:57] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:58:00] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:58:59] (03CR) 10Jforrester: [C:03+2] wikifunctions: Add releases function-evaluators in Rust, unused (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [19:02:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:06:50] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1281.eqiad.wmnet with OS bookworm [19:06:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923131 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1281.eqiad.wmnet with OS bookworm [19:07:16] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: add qwen36-27b to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) (owner: 10Ilias Sarantopoulos) [19:07:56] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923136 (10VRiley-WMF) [19:09:22] (03Merged) 10jenkins-bot: ml-services: add qwen36-27b to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287362 (https://phabricator.wikimedia.org/T425680) (owner: 10Ilias Sarantopoulos) [19:14:58] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [19:14:59] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1274.eqiad.wmnet with OS bookworm [19:15:10] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923140 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1274.eqiad.wmnet with OS bookworm completed: - db1274 (**PASS**) -... [19:19:43] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [19:19:43] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [19:19:57] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [19:19:57] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [19:20:33] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [19:20:33] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [19:20:47] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift [19:20:47] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [19:22:30] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [19:23:07] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1281.eqiad.wmnet with reason: host reimage [19:24:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah) [19:26:25] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1286] - vriley@cumin1003" [19:26:31] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1286] - vriley@cumin1003" [19:26:31] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:26:46] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1286 [19:28:03] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1286 [19:28:34] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1286.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:29:07] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1281.eqiad.wmnet with reason: host reimage [19:38:13] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1287.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:45:47] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [19:46:22] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1286.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:48:52] vriley@cumin1003 reimage (PID 4009498) is awaiting input [19:49:13] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [19:49:14] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1281.eqiad.wmnet with OS bookworm [19:49:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923219 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1281.eqiad.wmnet with OS bookworm completed: - db1281 (**PASS**) -... [19:53:23] (03CR) 10Bking: [C:03+1] IPReputation: Route opensearch_ipoid through envoy service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286804 (https://phabricator.wikimedia.org/T421293) (owner: 10Kosta Harlan) [19:56:16] 9 [19:56:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1287.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T2000). [20:00:05] JSherman, stephanebisson, codenamenoreste, and Neriah: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:26] o/ ready to go and can self deploy if needed [20:01:14] hi :) [20:02:24] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1286.eqiad.wmnet with OS bookworm [20:02:33] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923248 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1286.eqiad.wmnet with OS bookworm [20:03:19] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [20:04:15] It's looking like all config patches today. I was planning on rolling my 3 together as they are pretty straightforward. Does anybody want to hitch a ride on that deploy? [20:05:17] ^^ [20:05:41] for anyone who misses the ride, i can deploy for whoever needs - just ping [20:06:55] cjming: thanks! I'm showing 5 after now, so I'll get started. [20:07:07] sounds good [20:07:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192921 (https://phabricator.wikimedia.org/T405152) (owner: 10Kgraessle) [20:07:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286974 (https://phabricator.wikimedia.org/T420450) (owner: 10Jsn.sherman) [20:07:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286975 (https://phabricator.wikimedia.org/T425509) (owner: 10Jsn.sherman) [20:08:14] (03Merged) 10jenkins-bot: Enable AutoModerator on Italian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192921 (https://phabricator.wikimedia.org/T405152) (owner: 10Kgraessle) [20:08:18] (03Merged) 10jenkins-bot: Enable AutoModerator on Albanian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286974 (https://phabricator.wikimedia.org/T420450) (owner: 10Jsn.sherman) [20:08:22] (03Merged) 10jenkins-bot: Enable AutoModerator on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286975 (https://phabricator.wikimedia.org/T425509) (owner: 10Jsn.sherman) [20:10:33] looks like the deploy timed out on the rebase, which succeeded; retrying [20:11:08] !log jsn@deploy1003 Started scap sync-world: Backport for [[gerrit:1192921|Enable AutoModerator on Italian Wikipedia (T405152)]], [[gerrit:1286974|Enable AutoModerator on Albanian Wikipedia (T420450)]], [[gerrit:1286975|Enable AutoModerator on Dutch Wikipedia (T425509)]] [20:11:15] T405152: Enable AutoModerator on Italian Wikipedia - https://phabricator.wikimedia.org/T405152 [20:11:16] T420450: Enable AutoModerator on Albanian Wikipedia - https://phabricator.wikimedia.org/T420450 [20:11:16] T425509: Enable AutoModerator on Dutch Wikipedia (nlwiki) - https://phabricator.wikimedia.org/T425509 [20:11:46] love to see it: `0 languages rebuilt out of 549` [20:12:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11923273 (10Papaul) [20:13:03] !log jsn@deploy1003 kgraessle, jsn: Backport for [[gerrit:1192921|Enable AutoModerator on Italian Wikipedia (T405152)]], [[gerrit:1286974|Enable AutoModerator on Albanian Wikipedia (T420450)]], [[gerrit:1286975|Enable AutoModerator on Dutch Wikipedia (T425509)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:13:33] testing [20:14:39] !log jsn@deploy1003 kgraessle, jsn: Continuing with deployment [20:18:17] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1286.eqiad.wmnet with reason: host reimage [20:18:57] !log jsn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1192921|Enable AutoModerator on Italian Wikipedia (T405152)]], [[gerrit:1286974|Enable AutoModerator on Albanian Wikipedia (T420450)]], [[gerrit:1286975|Enable AutoModerator on Dutch Wikipedia (T425509)]] (duration: 07m 48s) [20:19:04] T405152: Enable AutoModerator on Italian Wikipedia - https://phabricator.wikimedia.org/T405152 [20:19:04] T420450: Enable AutoModerator on Albanian Wikipedia - https://phabricator.wikimedia.org/T420450 [20:19:04] T425509: Enable AutoModerator on Dutch Wikipedia (nlwiki) - https://phabricator.wikimedia.org/T425509 [20:19:05] cjming: all yours! [20:19:21] JSherman: thanks! [20:19:40] JSerman: Coming soon: https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/1187 [20:19:43] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [20:19:43] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [20:19:43] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [20:19:46] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1287.eqiad.wmnet with OS bookworm [20:19:52] oops: JSherman: ^^ [20:19:52] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1287.eqiad.wmnet with OS bookworm [20:19:59] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [20:20:01] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [20:20:07] stephanebisson: are you around? [20:20:25] codenamenoreste: are you around? [20:20:33] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [20:20:33] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [20:20:33] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift [20:20:48] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923292 (10VRiley-WMF) [20:20:49] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [20:20:51] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [20:21:07] Neriah: I think you're here? [20:21:13] ya [20:21:21] ok i'll do yours next [20:21:26] dancy: 🎉that's awesome!🎉 [20:21:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923293 (10VRiley-WMF) [20:22:01] (03PS3) 10Neriah: Disable wgNewUserMessageOnAutoCreate on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206) [20:22:31] (03CR) 10Neriah: Disable wgNewUserMessageOnAutoCreate on all WMF wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah) [20:23:27] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1286.eqiad.wmnet with reason: host reimage [20:24:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah) [20:24:58] (03Merged) 10jenkins-bot: Disable wgNewUserMessageOnAutoCreate on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah) [20:25:15] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1287002|Disable wgNewUserMessageOnAutoCreate on all WMF wikis (T426206)]] [20:25:19] T426206: Per global RfC, only welcome users on Wikimedia projects where they created their account or have edited - https://phabricator.wikimedia.org/T426206 [20:27:05] !log cjming@deploy1003 cjming, neriah: Backport for [[gerrit:1287002|Disable wgNewUserMessageOnAutoCreate on all WMF wikis (T426206)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:27:14] testing [20:28:52] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1289.eqiad.wmnet with OS bookworm [20:29:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923323 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1289.eqiad.wmnet with OS bookworm [20:29:10] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1289.eqiad.wmnet with OS bookworm [20:29:16] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923324 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1289.eqiad.wmnet with OS bookworm executed with errors: - db1289 (**F... [20:29:46] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:31:05] looks good [20:31:17] cjming: you can continue [20:31:26] !log cjming@deploy1003 cjming, neriah: Continuing with deployment [20:31:33] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:35:33] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287002|Disable wgNewUserMessageOnAutoCreate on all WMF wikis (T426206)]] (duration: 10m 18s) [20:35:37] T426206: Per global RfC, only welcome users on Wikimedia projects where they created their account or have edited - https://phabricator.wikimedia.org/T426206 [20:35:39] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1287.eqiad.wmnet with reason: host reimage [20:35:52] thanks :) [20:35:54] yw! [20:36:07] if anyone else shows up for the window and needs a deployer, please ping me [20:38:46] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1287.eqiad.wmnet with reason: host reimage [20:39:17] (03PS1) 10Ilias Sarantopoulos: Revert "ml-services: add qwen36-27b to experimental ns" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287480 [20:40:27] (03CR) 10Ilias Sarantopoulos: [C:03+2] Revert "ml-services: add qwen36-27b to experimental ns" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287480 (owner: 10Ilias Sarantopoulos) [20:40:34] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:41:12] Where are we in the backport window? [20:41:44] stephanebisson: do you need a deployer for your patch? [20:42:04] I can do it, is it my turn? [20:42:04] you can self-deploy or i'm happy to deploy for you [20:42:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11923368 (10Papaul) [20:42:10] sure - go for it [20:42:13] Thanks [20:42:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287427 (https://phabricator.wikimedia.org/T426278) (owner: 10Sbisson) [20:42:37] (03Merged) 10jenkins-bot: Revert "ml-services: add qwen36-27b to experimental ns" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287480 (owner: 10Ilias Sarantopoulos) [20:43:15] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [20:43:28] (03Merged) 10jenkins-bot: Simplewiki: include article wizard in AG experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287427 (https://phabricator.wikimedia.org/T426278) (owner: 10Sbisson) [20:43:39] vriley@cumin1003 reimage (PID 4018159) is awaiting input [20:43:41] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1287427|Simplewiki: include article wizard in AG experiment (T426278)]] [20:43:43] (03PS1) 10Bking: cirrussearch: Add server depool metadata [puppet] - 10https://gerrit.wikimedia.org/r/1287481 (https://phabricator.wikimedia.org/T327300) [20:43:45] T426278: Enable Article Guidance experiment on Simple English Wikipedia (simplewiki) - https://phabricator.wikimedia.org/T426278 [20:45:30] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1287427|Simplewiki: include article wizard in AG experiment (T426278)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:46:35] !log sbisson@deploy1003 sbisson: Continuing with deployment [20:48:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:50:44] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287427|Simplewiki: include article wizard in AG experiment (T426278)]] (duration: 07m 03s) [20:50:47] T426278: Enable Article Guidance experiment on Simple English Wikipedia (simplewiki) - https://phabricator.wikimedia.org/T426278 [20:52:02] I'm done [20:54:06] (03CR) 10Dzahn: [C:04-1] "for now let's stick to the simple control of each service being present or absent in Hiera, per DC (or per host)" [puppet] - 10https://gerrit.wikimedia.org/r/1287035 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:54:38] (03PS3) 10Seddon: Enable hCaptcha for account creation API on group 0 wiki's [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287479 (https://phabricator.wikimedia.org/T426043) [20:55:56] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:56:24] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:56:25] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1287.eqiad.wmnet with OS bookworm [20:56:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923405 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1287.eqiad.wmnet with OS bookworm completed: - db1287 (**PASS**) -... [20:57:26] (03PS2) 10Ryan Kemper: cirrussearch: Add server depool metadata [puppet] - 10https://gerrit.wikimedia.org/r/1287481 (https://phabricator.wikimedia.org/T327300) (owner: 10Bking) [20:57:46] (03CR) 10Ryan Kemper: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287481 (https://phabricator.wikimedia.org/T327300) (owner: 10Bking) [21:00:04] (03PS1) 10Dzahn: zuul: disable all services in codfw, keep enabled in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1287483 (https://phabricator.wikimedia.org/T395938) [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T2100) [21:02:12] (03Abandoned) 10Dzahn: zuul: make all service_ensures dependent on a single active server [puppet] - 10https://gerrit.wikimedia.org/r/1287035 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [21:02:56] (03CR) 10Dreamy Jazz: [C:03+1] Enable hCaptcha for account creation API on group 0 wiki's [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287479 (https://phabricator.wikimedia.org/T426043) (owner: 10Seddon) [21:03:41] FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:17] jouncebot: nowandnext [21:04:18] For the next 0 hour(s) and 55 minute(s): Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260514T2100) [21:04:18] In 8 hour(s) and 55 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260515T0600) [21:04:29] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1287483/8561/" [puppet] - 10https://gerrit.wikimedia.org/r/1287483 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [21:06:54] (03PS1) 10Dreamy Jazz: Remove DynamicPageList from legalteamwiki as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287484 [21:07:35] Going to use scap shortly [21:08:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287479 (https://phabricator.wikimedia.org/T426043) (owner: 10Seddon) [21:08:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287484 (owner: 10Dreamy Jazz) [21:09:43] (03Merged) 10jenkins-bot: Enable hCaptcha for account creation API on group 0 wiki's [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287479 (https://phabricator.wikimedia.org/T426043) (owner: 10Seddon) [21:09:46] (03Merged) 10jenkins-bot: Remove DynamicPageList from legalteamwiki as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287484 (owner: 10Dreamy Jazz) [21:10:02] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1287479|Enable hCaptcha for account creation API on group 0 wiki's]], [[gerrit:1287484|Remove DynamicPageList from legalteamwiki as unused]] [21:11:49] !log dreamyjazz@deploy1003 dreamyjazz, seddon: Backport for [[gerrit:1287479|Enable hCaptcha for account creation API on group 0 wiki's]], [[gerrit:1287484|Remove DynamicPageList from legalteamwiki as unused]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:12:26] !log dreamyjazz@deploy1003 dreamyjazz, seddon: Continuing with deployment [21:12:57] (03PS1) 10Jdrewniak: Disable Reading Lists survey for Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287485 [21:15:08] !log vriley@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [21:15:09] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1286.eqiad.wmnet with OS bookworm [21:15:17] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11923442 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1286.eqiad.wmnet with OS bookworm completed: - db1286 (**WARN**) -... [21:15:27] (03PS2) 10Jdrewniak: Disable Reading Lists survey for Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287485 (https://phabricator.wikimedia.org/T421776) [21:16:35] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287479|Enable hCaptcha for account creation API on group 0 wiki's]], [[gerrit:1287484|Remove DynamicPageList from legalteamwiki as unused]] (duration: 06m 33s) [21:16:42] Finished with scap [21:17:37] (03CR) 10Anne Tomasevich: [C:03+1] Disable Reading Lists survey for Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287485 (https://phabricator.wikimedia.org/T421776) (owner: 10Jdrewniak) [21:17:50] (03CR) 10Bking: [C:03+2] cirrussearch: Add server depool metadata [puppet] - 10https://gerrit.wikimedia.org/r/1287481 (https://phabricator.wikimedia.org/T327300) (owner: 10Bking) [21:18:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:19:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287485 (https://phabricator.wikimedia.org/T421776) (owner: 10Jdrewniak) [21:19:43] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [21:19:43] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [21:19:59] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [21:19:59] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [21:20:01] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [21:20:33] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [21:20:33] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [21:20:49] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [21:20:51] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.218 second response time https://wikitech.wikimedia.org/wiki/Swift [21:20:51] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [21:23:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287485 (https://phabricator.wikimedia.org/T421776) (owner: 10Jdrewniak) [21:24:19] (03Merged) 10jenkins-bot: Disable Reading Lists survey for Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287485 (https://phabricator.wikimedia.org/T421776) (owner: 10Jdrewniak) [21:24:33] !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1287485|Disable Reading Lists survey for Wikipedias (T421776)]] [21:24:37] T421776: Enable the beta feature survey - https://phabricator.wikimedia.org/T421776 [21:26:21] !log jdrewniak@deploy1003 jdrewniak: Backport for [[gerrit:1287485|Disable Reading Lists survey for Wikipedias (T421776)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:28:05] When Jan is finished, I have one final readers patch during today's window that I plan to backport [21:28:30] RESOLVED: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [21:29:38] !log jdrewniak@deploy1003 jdrewniak: Continuing with deployment [21:33:48] !log jdrewniak@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287485|Disable Reading Lists survey for Wikipedias (T421776)]] (duration: 09m 15s) [21:33:52] T421776: Enable the beta feature survey - https://phabricator.wikimedia.org/T421776 [21:34:43] (03CR) 10Cwhite: Configure nginx to log requests in ECS format to syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [21:35:14] (03PS1) 10Eric Gardner: Share Highlight: overdraw photo on share card canvas [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287488 (https://phabricator.wikimedia.org/T426344) [21:38:16] (03CR) 10Cwhite: [C:03+1] "Recommend holding until the nginx config change is merged and verifying the nginx output logs' ECS compatibility with https://doc.wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [21:38:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287488 (https://phabricator.wikimedia.org/T426344) (owner: 10Eric Gardner) [21:38:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:39:49] (03Merged) 10jenkins-bot: Share Highlight: overdraw photo on share card canvas [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287488 (https://phabricator.wikimedia.org/T426344) (owner: 10Eric Gardner) [21:40:05] !log egardner@deploy1003 Started scap sync-world: Backport for [[gerrit:1287488|Share Highlight: overdraw photo on share card canvas (T426344)]] [21:40:09] T426344: [Share Highlights] Image is not showing in the Share card on certain clients - https://phabricator.wikimedia.org/T426344 [21:41:51] !log egardner@deploy1003 egardner: Backport for [[gerrit:1287488|Share Highlight: overdraw photo on share card canvas (T426344)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:43:11] !log egardner@deploy1003 egardner: Continuing with deployment [21:47:19] !log egardner@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287488|Share Highlight: overdraw photo on share card canvas (T426344)]] (duration: 07m 14s) [21:47:23] T426344: [Share Highlights] Image is not showing in the Share card on certain clients - https://phabricator.wikimedia.org/T426344 [21:53:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:11:40] (03CR) 10Dzahn: [V:03+1 C:03+2] zuul: disable all services in codfw, keep enabled in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1287483 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:13:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:19:43] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:19:43] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:19:43] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:19:57] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:19:59] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:20:01] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:20:01] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:20:01] PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:20:01] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:20:01] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:20:01] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:20:03] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:20:05] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:20:05] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:20:33] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:33] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:33] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:47] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:49] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:51] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:51] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:51] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:51] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:51] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:52] RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:55] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:55] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:55] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [22:24:38] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [22:33:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:45:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11923791 (10Papaul) [22:46:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11923792 (10Papaul) 05Open→03Resolved This is complete. thanks to @Jhancock.wm and @Jgreen [22:48:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:49:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:55:54] 06SRE, 10corto, 10Incident Tooling, 13Patch-For-Review: Increase trusted volunteer's visibility into production incidents - https://phabricator.wikimedia.org/T426137#11923800 (10Novem_Linguae) Thanks for working on this and for the quick and thorough replies. I appreciate it. > For the most part, the wiki... [23:02:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:07:17] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [23:09:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:10:05] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:11:15] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1289 [23:12:52] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1289 [23:13:34] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:14:45] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:19:43] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:19:43] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:19:43] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:19:43] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:19:57] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:19:59] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:19:59] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:19:59] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:20:03] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:20:03] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:20:03] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:20:03] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:20:03] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:20:05] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:20:05] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:20:05] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:20:33] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:33] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:33] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:33] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:47] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:49] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:49] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:51] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:53] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:53] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:53] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:53] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:53] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:55] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:55] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [23:20:55] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Swift [23:24:19] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:26:18] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:27:28] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:30:01] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:31:35] (03PS1) 10Thcipriani: phabricator::migration: require config before scap [puppet] - 10https://gerrit.wikimedia.org/r/1287447 (https://phabricator.wikimedia.org/T424055) (owner: 10Dzahn) [23:33:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:34:49] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:38:55] vriley@cumin1003 provision (PID 4044236) is awaiting input [23:39:08] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:40:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1287498 [23:40:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1287498 (owner: 10TrainBranchBot) [23:49:49] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:51:09] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1287498 (owner: 10TrainBranchBot) [23:53:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:53:51] vriley@cumin1003 provision (PID 4044814) is awaiting input [23:54:04] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:55:48] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1290 [23:57:00] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1290 [23:57:24] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:58:55] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:59:21] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED