[00:01:24] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:01:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:01:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2006.codfw.wmnet with OS trixie [00:02:01] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12027390 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wikikube-ctrl2006.codf... [00:13:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [00:14:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [00:30:35] (03PS2) 10RLazarus: tox: Bump flake8 to 7.3.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302989 [00:30:35] (03PS2) 10RLazarus: tox: Test up to Python 3.14 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302990 [00:30:35] (03PS2) 10RLazarus: Release 4.0.5 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302991 [00:39:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [00:40:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [00:53:54] (03PS1) 10Jasmine: Add Kubernetes POD IP reverse range delegations for wikikube-ctrl1005 [dns] - 10https://gerrit.wikimedia.org/r/1302996 (https://phabricator.wikimedia.org/T418920) [01:00:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [01:12:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1302997 [01:12:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1302997 (owner: 10TrainBranchBot) [01:17:41] (03CR) 10ArielGlenn: rest-gateway: emit 401 if rate limit is 0 (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298031 (https://phabricator.wikimedia.org/T428184) (owner: 10Daniel Kinzler) [01:19:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [01:20:36] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1302997 (owner: 10TrainBranchBot) [01:30:06] (03PS1) 10DDesouza: Add English Wikipedia Mobile App Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302998 (https://phabricator.wikimedia.org/T428876) [01:31:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302998 (https://phabricator.wikimedia.org/T428876) (owner: 10DDesouza) [01:47:35] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:52:35] FIRING: [2x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:01:26] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:35] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:01] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 35s) [02:10:41] (03CR) 10Scott French: [C:03+1] fundraising_data_import maintenance script wrapper & timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [02:12:35] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:40] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:03] (03PS3) 10RLazarus: Release 4.0.5 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302991 [02:15:03] (03PS1) 10RLazarus: builder: Fix type error and unpin mypy version [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302999 [02:16:11] FIRING: [5x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:21:11] FIRING: [5x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:39:49] (03PS1) 10Clare Ming: Add phabricator api token for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) [02:41:37] (03CR) 10CI reject: [V:04-1] Add phabricator api token for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) (owner: 10Clare Ming) [02:41:58] (03PS2) 10Clare Ming: Add phabricator api token for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) [02:42:28] (03PS1) 10BPirkle: REST: Adjust key of Reading Lists OpenAPI spec in RestSandboxSpecs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303004 (https://phabricator.wikimedia.org/T422771) [02:43:47] (03CR) 10CI reject: [V:04-1] Add phabricator api token for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) (owner: 10Clare Ming) [02:50:07] (03PS3) 10Clare Ming: Add phabricator api token for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) [02:50:57] (03PS4) 10Clare Ming: Add phabricator api token for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) [02:52:45] (03PS5) 10Clare Ming: Add Phabricator specific configuration for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) [02:52:50] (03CR) 10CI reject: [V:04-1] Add Phabricator specific configuration for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) (owner: 10Clare Ming) [02:54:33] (03CR) 10CI reject: [V:04-1] Add Phabricator specific configuration for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) (owner: 10Clare Ming) [02:55:47] 10SRE-Access-Requests: Change SSH key for denisse after new laptop provissioning - https://phabricator.wikimedia.org/T429429 (10andrea.denisse) 03NEW [02:56:43] (03PS1) 10Krinkle: Disable ShortUrl on mrwiki, newiki, pawiki, tawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303006 (https://phabricator.wikimedia.org/T107188) [03:05:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:06:12] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:06:34] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:07:34] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:11:12] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:11:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:21:20] 10SRE-Access-Requests, 13Patch-For-Review: Change SSH key for denisse after new laptop provissioning - https://phabricator.wikimedia.org/T429429#12027519 (10andrea.denisse) [03:27:37] (03PS3) 10Abijeet Patro: ULS rewrite: Lock body scroll when open on mobile [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302739 [03:28:26] (03PS1) 10Abijeet Patro: ULS rewrite: Capture trigger element before async module load [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303009 (https://phabricator.wikimedia.org/T429145) [03:28:59] (03PS1) 10Abijeet Patro: ULS rewrite: Show variants even when no languages are available [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303010 (https://phabricator.wikimedia.org/T426532) [03:45:06] (03PS2) 10Abijeet Patro: Enable ULS v2 on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303012 [03:45:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303012 (owner: 10Abijeet Patro) [03:46:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303010 (https://phabricator.wikimedia.org/T426532) (owner: 10Abijeet Patro) [03:47:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303009 (https://phabricator.wikimedia.org/T429145) (owner: 10Abijeet Patro) [03:51:24] (03CR) 10Scott French: [C:03+1] test_cli: Update assertEquals to assertEqual [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302988 (owner: 10RLazarus) [03:51:26] (03CR) 10Scott French: [C:03+1] tox: Bump flake8 to 7.3.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302989 (owner: 10RLazarus) [03:51:28] (03CR) 10Scott French: [C:03+1] tox: Test up to Python 3.14 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302990 (owner: 10RLazarus) [03:51:31] (03CR) 10Scott French: [C:03+1] builder: Fix type error and unpin mypy version [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302999 (owner: 10RLazarus) [03:51:40] (03CR) 10Scott French: [C:03+1] Release 4.0.5 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302991 (owner: 10RLazarus) [05:02:43] (03CR) 10WMDE-Fisch: "Thanks a lot 🙏" [extensions/VisualEditor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302872 (https://phabricator.wikimedia.org/T428764) (owner: 10WMDE-Fisch) [05:03:08] (03Abandoned) 10WMDE-Fisch: Fix VE core submodule update to 3e79e9934 [extensions/VisualEditor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302872 (https://phabricator.wikimedia.org/T428764) (owner: 10WMDE-Fisch) [05:07:41] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for laurabarluzzi - https://phabricator.wikimedia.org/T429431 (10Laurabarluzzi) 03NEW [05:19:38] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [05:19:57] (03CR) 10Marostegui: [C:03+2] mysql-gtid.yaml: Add pint [alerts] - 10https://gerrit.wikimedia.org/r/1302724 (https://phabricator.wikimedia.org/T427469) (owner: 10Marostegui) [05:23:09] 06SRE, 10observability, 06SRE Observability, 13Patch-For-Review: Alerts showing "AlertLintProblem" - MySQLReplicaNotUsingGTID - https://phabricator.wikimedia.org/T427469#12027616 (10Marostegui) Merged - let's give it sometime to see if this fixes the problem. [05:32:06] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#12027626 (10Marostegui) @VRiley-WMF did you have some time to check if any of the above hosts could work to get some pieces to replace the ones failing on db1224? Thanks! [05:37:10] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [05:37:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1037: Upgrading es1037.eqiad.wmnet [05:37:40] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1037: Upgrading es1037.eqiad.wmnet [05:38:38] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1037.eqiad.wmnet with OS trixie [05:47:35] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:52:35] FIRING: [2x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:54:17] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1037.eqiad.wmnet with reason: host reimage [05:59:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1037.eqiad.wmnet with reason: host reimage [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T0600) [06:04:44] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2247 - https://phabricator.wikimedia.org/T429348#12027675 (10Marostegui) p:05Triage→03Medium [06:07:35] FIRING: [4x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:12:41] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:14:41] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:15:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:16:11] FIRING: [8x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:16:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1037.eqiad.wmnet with OS trixie [06:20:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12027720 (10Marostegui) @Jclark-ctr what do you think of the above? worth investigating or should I repool? [06:20:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:28:34] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1037: Migration of es1037.eqiad.wmnet completed [06:29:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [06:31:43] 06SRE, 06Infrastructure-Foundations: Upgrade Cumin hosts to Trixie - https://phabricator.wikimedia.org/T427897#12027726 (10MoritzMuehlenhoff) @jcrespo Cumin is now working on cumin2003, you can test backups. [06:32:32] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1303286 (https://phabricator.wikimedia.org/T429436) [06:32:38] (03PS1) 10Gerrit maintenance bot: wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1303287 (https://phabricator.wikimedia.org/T429436) [06:35:46] (03PS1) 10Marostegui: db-production.php: Disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303288 (https://phabricator.wikimedia.org/T429118) [06:43:59] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:44:08] (03PS1) 10Giuseppe Lavagetto: Re-release changes introduced previously, plus use a trie to aggregate ipblocks [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1303289 [06:44:26] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Re-release changes introduced previously, plus use a trie to aggregate ipblocks [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1303289 (owner: 10Giuseppe Lavagetto) [06:44:36] (03CR) 10Elukey: [C:03+2] preseed: fix partman config for the new conf2* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1302921 (https://phabricator.wikimedia.org/T418914) (owner: 10Elukey) [06:44:43] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [06:46:01] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Haproxy provenance maps in HP; UX changes - oblivian@cumin1003" [06:46:03] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Haproxy provenance maps in HP; UX changes - oblivian@cumin1003 [06:46:54] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Haproxy provenance maps in HP; UX changes - oblivian@cumin1003 [06:46:55] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Haproxy provenance maps in HP; UX changes - oblivian@cumin1003" [06:48:22] FIRING: CertAlmostExpired: gNMI TLS certificate for lsw1-b7-codfw.mgmt.codfw.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:50:19] (03PS1) 10Marostegui: Revert "es2045: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1303290 [06:51:28] (03PS1) 10Giuseppe Lavagetto: Revert "Re-release changes introduced previously, plus use a trie to aggregate ipblocks" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1303291 [06:51:51] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Revert "Re-release changes introduced previously, plus use a trie to aggregate ipblocks" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1303291 (owner: 10Giuseppe Lavagetto) [06:52:18] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "revert deployment - oblivian@cumin1003" [06:52:19] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: revert deployment - oblivian@cumin1003 [06:53:12] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: revert deployment - oblivian@cumin1003 [06:53:13] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "revert deployment - oblivian@cumin1003" [06:53:40] (03CR) 10Muehlenhoff: [C:03+2] Remove the black box check for mirrors.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1302858 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [06:55:43] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [06:56:01] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.88 ms [06:58:22] RESOLVED: CertAlmostExpired: gNMI TLS certificate for lsw1-b7-codfw.mgmt.codfw.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:59:20] (03CR) 10Elukey: sre.hosts.provision: introduce the wmfroot user (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [07:00:01] (03CR) 10Nikerabbit: [C:03+1] ULS rewrite: Capture trigger element before async module load [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303009 (https://phabricator.wikimedia.org/T429145) (owner: 10Abijeet Patro) [07:00:05] Amir1, urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T0700). Please do the needful. [07:00:05] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:33] (03CR) 10Nikerabbit: [C:03+1] ULS rewrite: Show variants even when no languages are available [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303010 (https://phabricator.wikimedia.org/T426532) (owner: 10Abijeet Patro) [07:00:47] (03CR) 10Nikerabbit: [C:03+1] Enable ULS v2 on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303012 (owner: 10Abijeet Patro) [07:02:00] o/ [07:04:36] (03PS7) 10Elukey: sre.hosts.provision: introduce the wmfroot user [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) [07:04:59] (03CR) 10Elukey: sre.hosts.provision: introduce the wmfroot user (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [07:12:15] (03CR) 10Nikerabbit: [C:03+1] ULS rewrite: Lock body scroll when open on mobile [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302739 (owner: 10Abijeet Patro) [07:12:34] (03CR) 10Nikerabbit: [C:03+1] ULS rewrite: Fix settings dialog width and field sizing [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302743 (https://phabricator.wikimedia.org/T416512) (owner: 10Abijeet Patro) [07:14:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1037: Migration of es1037.eqiad.wmnet completed [07:14:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [07:15:25] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:15:42] (03CR) 10JMeybohm: [C:03+1] "Nice!" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1300145 (https://phabricator.wikimedia.org/T427403) (owner: 10Jelto) [07:15:45] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1044: Upgrading es1044.eqiad.wmnet [07:16:16] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1044: Upgrading es1044.eqiad.wmnet [07:17:26] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1044.eqiad.wmnet with OS trixie [07:20:11] (03PS1) 10Giuseppe Lavagetto: Code changes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1303294 [07:20:30] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Code changes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1303294 (owner: 10Giuseppe Lavagetto) [07:21:41] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Haproxy provenance maps in HP; UX changes (attempt 3) - oblivian@cumin1003" [07:21:43] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Haproxy provenance maps in HP; UX changes (attempt 3) - oblivian@cumin1003 [07:21:50] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [07:22:31] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Haproxy provenance maps in HP; UX changes (attempt 3) - oblivian@cumin1003 [07:22:32] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Haproxy provenance maps in HP; UX changes (attempt 3) - oblivian@cumin1003" [07:22:57] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [07:23:23] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host conf2007.codfw.wmnet with OS trixie [07:23:51] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [07:25:45] (03PS1) 10Matthias Mullie: Squashed diff to master [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303295 [07:26:10] (03PS1) 10Matthias Mullie: Squashed diff to master [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303296 [07:27:29] (03CR) 10CI reject: [V:04-1] Squashed diff to master [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303295 (owner: 10Matthias Mullie) [07:30:44] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host conf2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:32:29] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1044.eqiad.wmnet with reason: host reimage [07:33:35] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8747/co" [puppet] - 10https://gerrit.wikimedia.org/r/1299939 (https://phabricator.wikimedia.org/T422249) (owner: 10Giuseppe Lavagetto) [07:34:46] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] haproxy: get ipblock map directly from HP [puppet] - 10https://gerrit.wikimedia.org/r/1299939 (https://phabricator.wikimedia.org/T422249) (owner: 10Giuseppe Lavagetto) [07:35:07] (03CR) 10Matthias Mullie: "recheck" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303295 (owner: 10Matthias Mullie) [07:39:40] (03PS2) 10Matthias Mullie: Squashed diff to master [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303295 [07:39:41] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on conf2007.codfw.wmnet with reason: host reimage [07:40:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1044.eqiad.wmnet with reason: host reimage [07:40:48] (03CR) 10CI reject: [V:04-1] Squashed diff to master [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303295 (owner: 10Matthias Mullie) [07:41:34] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host conf2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:42:13] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host conf2008.codfw.wmnet with OS trixie [07:43:19] (03PS3) 10Matthias Mullie: Squashed diff to master [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303295 [07:43:25] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [07:43:33] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [07:43:55] (03CR) 10Arnaudb: [C:03+2] ci: monitor a wider variety of network errors [puppet] - 10https://gerrit.wikimedia.org/r/1302829 (https://phabricator.wikimedia.org/T420865) (owner: 10Arnaudb) [07:44:38] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on conf2007.codfw.wmnet with reason: host reimage [07:46:27] noone deploying atm? I'll use the remainder of this slot if that's alright :) [07:46:35] (03CR) 10Filippo Giunchedi: [C:03+1] ceph: allow to set client transport encryption [puppet] - 10https://gerrit.wikimedia.org/r/1302904 (https://phabricator.wikimedia.org/T294432) (owner: 10Volans) [07:47:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303296 (owner: 10Matthias Mullie) [07:47:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303295 (owner: 10Matthias Mullie) [07:47:33] (03PS1) 10Giuseppe Lavagetto: cache::haproxy: fix reference in template [puppet] - 10https://gerrit.wikimedia.org/r/1303300 [07:47:45] (03CR) 10Filippo Giunchedi: [C:03+1] Cinder backups: enable transport encryption part 1 [puppet] - 10https://gerrit.wikimedia.org/r/1302905 (https://phabricator.wikimedia.org/T294432) (owner: 10Volans) [07:47:50] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] cache::haproxy: fix reference in template [puppet] - 10https://gerrit.wikimedia.org/r/1303300 (owner: 10Giuseppe Lavagetto) [07:48:21] (03Merged) 10jenkins-bot: Squashed diff to master [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303296 (owner: 10Matthias Mullie) [07:48:22] (03Merged) 10jenkins-bot: Squashed diff to master [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303295 (owner: 10Matthias Mullie) [07:49:51] !log mlitn@deploy1003 Started scap sync-world: Backport for [[gerrit:1303296|Squashed diff to master]], [[gerrit:1303295|Squashed diff to master]] [07:50:13] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host conf2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:51:31] (03PS1) 10Slyngshede: C:apereo_cas: Script for resetting webauthn device registration [puppet] - 10https://gerrit.wikimedia.org/r/1303304 [07:52:04] (03CR) 10CI reject: [V:04-1] C:apereo_cas: Script for resetting webauthn device registration [puppet] - 10https://gerrit.wikimedia.org/r/1303304 (owner: 10Slyngshede) [07:52:40] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8748/co" [puppet] - 10https://gerrit.wikimedia.org/r/1303304 (owner: 10Slyngshede) [07:53:13] (03PS4) 10Abijeet Patro: ULS rewrite: Lock body scroll when open on mobile [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302739 [07:53:52] (03PS1) 10Abijeet Patro: ULS rewrite: Lock scroll too, not just [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303323 [07:53:55] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host conf2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:54:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303323 (owner: 10Abijeet Patro) [07:54:29] (03PS2) 10Abijeet Patro: ULS rewrite: Sync the fullscreen mobile selector with a URL route [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303297 (https://phabricator.wikimedia.org/T428778) [07:54:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303297 (https://phabricator.wikimedia.org/T428778) (owner: 10Abijeet Patro) [07:57:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1044.eqiad.wmnet with OS trixie [07:58:06] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host conf2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:58:07] (03CR) 10Muehlenhoff: [C:03+2] Apply urldownloader role to urldownloader1005/1006/2006 [puppet] - 10https://gerrit.wikimedia.org/r/1302805 (https://phabricator.wikimedia.org/T427282) (owner: 10Muehlenhoff) [07:59:57] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [08:00:06] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on conf2008.codfw.wmnet with reason: host reimage [08:00:12] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1044: repool after upgrade [08:01:31] !log btullis@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [08:01:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-wdqs1001.eqiad.wmnet with OS bookworm [08:03:06] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [08:03:32] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [08:04:10] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [08:04:10] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host conf2007.codfw.wmnet with OS trixie [08:04:23] PROBLEM - statsv Varnishkafka log producer on cp7005 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [08:04:34] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on conf2008.codfw.wmnet with reason: host reimage [08:04:51] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host conf2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:05:23] RECOVERY - statsv Varnishkafka log producer on cp7005 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [08:06:07] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host conf2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:06:51] btullis@cumin1003 reimage (PID 348375) is awaiting input [08:07:29] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs2001.codfw.wmnet with OS bookworm [08:07:43] (03CR) 10Jelto: [V:03+1 C:03+2] Build helm3.19 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1300145 (https://phabricator.wikimedia.org/T427403) (owner: 10Jelto) [08:08:32] (03CR) 10Marostegui: [C:03+2] Revert "es2045: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1303290 (owner: 10Marostegui) [08:09:27] !log mlitn@deploy1003 mlitn: Backport for [[gerrit:1303296|Squashed diff to master]], [[gerrit:1303295|Squashed diff to master]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:12:02] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host conf2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:12:20] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300750 (owner: 10PipelineBot) [08:12:21] !log mlitn@deploy1003 mlitn: Continuing with deployment [08:13:08] (03PS2) 10Slyngshede: C:apereo_cas: Script for resetting webauthn device registration [puppet] - 10https://gerrit.wikimedia.org/r/1303304 [08:13:37] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088#12028023 (10RKemper) We should be able to get the PoC working again by bumping confluent-kafka to 2.14.2 (latest) and swapping out... [08:14:00] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host conf2009.codfw.wmnet with OS trixie [08:14:26] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300750 (owner: 10PipelineBot) [08:17:48] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs2001.codfw.wmnet with OS bookworm [08:19:15] (03PS1) 10Muehlenhoff: admin-ng: Allow the new URL downloaders [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303341 (https://phabricator.wikimedia.org/T427282) [08:21:40] (03PS24) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [08:22:24] (03PS4) 10Trueg: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) [08:22:56] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [08:23:13] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [08:23:13] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host conf2008.codfw.wmnet with OS trixie [08:23:28] (03PS1) 10Giuseppe Lavagetto: requestctl_cli: add update-provenance-map command [puppet] - 10https://gerrit.wikimedia.org/r/1303342 [08:23:47] (03PS1) 10Kevin Bazira: ml-services: deploy cope-b-a4b isvc that trims response to violation, p_violation, p_safe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303343 (https://phabricator.wikimedia.org/T427497) [08:25:25] !log mlitn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1303296|Squashed diff to master]], [[gerrit:1303295|Squashed diff to master]] (duration: 35m 34s) [08:25:58] (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: deploy cope-b-a4b isvc that trims response to violation, p_violation, p_safe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303343 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [08:27:13] (03CR) 10Giuseppe Lavagetto: [C:03+2] requestctl_cli: add update-provenance-map command [puppet] - 10https://gerrit.wikimedia.org/r/1303342 (owner: 10Giuseppe Lavagetto) [08:29:10] (03PS1) 10Arnaudb: ci: update jenkins build monitor [puppet] - 10https://gerrit.wikimedia.org/r/1303344 (https://phabricator.wikimedia.org/T420865) [08:29:44] (03CR) 10Arnaudb: [C:03+2] ci: update jenkins build monitor [puppet] - 10https://gerrit.wikimedia.org/r/1303344 (https://phabricator.wikimedia.org/T420865) (owner: 10Arnaudb) [08:30:35] (03PS4) 10Arnaudb: gerrit: add 5xx and 4xx alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1301233 (https://phabricator.wikimedia.org/T428979) [08:30:59] (03CR) 10Gmodena: EventStreamConfig: add stream for WDQS V2 external queries. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302923 (https://phabricator.wikimedia.org/T429380) (owner: 10Lerickson) [08:31:11] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on conf2009.codfw.wmnet with reason: host reimage [08:31:40] (03CR) 10Gmodena: "+ Andrew and Thomas for an ACK." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302923 (https://phabricator.wikimedia.org/T429380) (owner: 10Lerickson) [08:32:36] (03CR) 10Brouberol: dse-k8s-services: WDQS deployment helmfile values (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [08:32:48] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088#12028089 (10elukey) We have a Kafka upgrade Cloud project that we could use with Pontoon, we could use it to create the PoC in ther... [08:33:25] (03CR) 10Kevin Bazira: [C:03+2] ml-services: deploy cope-b-a4b isvc that trims response to violation, p_violation, p_safe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303343 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [08:34:22] (03PS1) 10Joal: Update turnilo config for banner_activity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303346 (https://phabricator.wikimedia.org/T414478) [08:35:09] (03CR) 10Btullis: [C:03+1] Update turnilo config for banner_activity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303346 (https://phabricator.wikimedia.org/T414478) (owner: 10Joal) [08:35:18] (03PS2) 10Joal: Update turnilo config for banner_activity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303346 (https://phabricator.wikimedia.org/T414478) [08:35:20] (03CR) 10Trueg: dse-k8s-services: WDQS deployment helmfile values (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [08:35:33] (03Merged) 10jenkins-bot: ml-services: deploy cope-b-a4b isvc that trims response to violation, p_violation, p_safe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303343 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [08:35:47] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1301883 (https://phabricator.wikimedia.org/T422249) (owner: 10Giuseppe Lavagetto) [08:35:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on conf2009.codfw.wmnet with reason: host reimage [08:36:37] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:37:29] (03CR) 10Brouberol: dse-k8s-services: Enable ingress on WDQS namespaces (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) (owner: 10Trueg) [08:37:34] (03CR) 10Btullis: [C:03+2] Update turnilo config for banner_activity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303346 (https://phabricator.wikimedia.org/T414478) (owner: 10Joal) [08:38:08] !log installing apache2 security updates [08:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:36] !log "Imported helm3 3.19.5-1 to bullseye-wikimedia, bookworm-wikimedia and trixie-wikimedia - T427403" [08:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:41] T427403: Update helm to 3.19 - https://phabricator.wikimedia.org/T427403 [08:39:34] (03PS2) 10Giuseppe Lavagetto: fetch_external_clouds_vendors_nets: commit changes to provenance map [puppet] - 10https://gerrit.wikimedia.org/r/1301883 (https://phabricator.wikimedia.org/T422249) [08:39:44] (03Merged) 10jenkins-bot: Update turnilo config for banner_activity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303346 (https://phabricator.wikimedia.org/T414478) (owner: 10Joal) [08:41:38] (03CR) 10Fabfur: [C:03+1] haproxy: use ipblocks map created by hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1299940 (https://phabricator.wikimedia.org/T422249) (owner: 10Giuseppe Lavagetto) [08:41:46] (03PS25) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [08:43:03] (03PS5) 10Trueg: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) [08:43:44] 10ops-codfw, 06DC-Ops: Power Supply - Status - issue on cirrussearch2080:9290 - https://phabricator.wikimedia.org/T429448 (10phaultfinder) 03NEW [08:44:22] (03CR) 10Giuseppe Lavagetto: [C:03+2] fetch_external_clouds_vendors_nets: commit changes to provenance map [puppet] - 10https://gerrit.wikimedia.org/r/1301883 (https://phabricator.wikimedia.org/T422249) (owner: 10Giuseppe Lavagetto) [08:45:07] (03CR) 10Fabfur: [C:03+1] "lgtm, [nit] missing Bug on commit message" [puppet] - 10https://gerrit.wikimedia.org/r/1299941 (owner: 10Giuseppe Lavagetto) [08:45:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1044: repool after upgrade [08:46:04] (03CR) 10Clément Goubert: rest-gateway: emit 401 if rate limit is 0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298031 (https://phabricator.wikimedia.org/T428184) (owner: 10Daniel Kinzler) [08:46:20] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 30 hosts with reason: Primary switchover s1 T429190 [08:46:23] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs2001.codfw.wmnet with OS bookworm [08:46:24] T429190: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T429190 [08:46:44] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Set db2212 with weight 0 T429190', diff saved to https://phabricator.wikimedia.org/P94215 and previous config saved to /var/cache/conftool/dbconfig/20260617-084642-cwilliams.json [08:48:31] (03PS1) 10Marostegui: wmnet: Update es5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1303347 (https://phabricator.wikimedia.org/T428572) [08:48:46] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:49:33] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1302167 (https://phabricator.wikimedia.org/T429190) [08:49:34] (03CR) 10Marostegui: [C:03+2] wmnet: Update es5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1303347 (https://phabricator.wikimedia.org/T428572) (owner: 10Marostegui) [08:49:38] !log marostegui@dns1004 START - running authdns-update [08:50:00] (03CR) 10JMeybohm: [C:03+1] admin-ng: Allow the new URL downloaders [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303341 (https://phabricator.wikimedia.org/T427282) (owner: 10Muehlenhoff) [08:51:21] !log marostegui@dns1004 END - running authdns-update [08:51:30] (03CR) 10CWilliams: [C:03+2] mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1302167 (https://phabricator.wikimedia.org/T429190) (owner: 10Gerrit maintenance bot) [08:51:54] !log Starting s1 codfw failover from db2203 to db2212 - T429190 [08:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:58] T429190: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T429190 [08:52:59] (03CR) 10Elukey: [C:03+1] admin-ng: Allow the new URL downloaders [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303341 (https://phabricator.wikimedia.org/T427282) (owner: 10Muehlenhoff) [08:53:11] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Promote db2212 to s1 primary T429190', diff saved to https://phabricator.wikimedia.org/P94217 and previous config saved to /var/cache/conftool/dbconfig/20260617-085310-cwilliams.json [08:55:03] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [08:55:24] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#12028189 (10elukey) All hosts provisioned and reimaged. Please keep in mind that I used test-cookbook to test new changes, they are still in code... [08:55:28] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [08:55:28] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host conf2009.codfw.wmnet with OS trixie [08:56:16] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Depool db2203 T429190', diff saved to https://phabricator.wikimedia.org/P94218 and previous config saved to /var/cache/conftool/dbconfig/20260617-085615-cwilliams.json [08:57:02] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs2001.codfw.wmnet with OS bookworm [08:57:14] (03CR) 10Brouberol: dse-k8s-services: WDQS deployment helmfile values (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [08:58:21] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [08:58:38] (03PS3) 10Giuseppe Lavagetto: haproxy: use ipblocks map created by hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1299940 (https://phabricator.wikimedia.org/T422249) [08:59:06] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303349 (https://phabricator.wikimedia.org/T425336) [08:59:44] !log joal@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [08:59:59] !log joal@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [09:01:50] !log joal@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [09:02:03] !log joal@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [09:02:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es2046 to es5 codfw primary T428572', diff saved to https://phabricator.wikimedia.org/P94219 and previous config saved to /var/cache/conftool/dbconfig/20260617-090221-marostegui.json [09:02:26] T428572: Migrate es5 section to Debian Trixie - https://phabricator.wikimedia.org/T428572 [09:02:28] (03CR) 10Trueg: dse-k8s-services: Enable ingress on WDQS namespaces (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) (owner: 10Trueg) [09:03:41] (03CR) 10Trueg: dse-k8s-services: WDQS deployment helmfile values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [09:03:44] (03PS1) 10Blake: mcrouter_wancache: Remove 2 gutterpool servers for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/1303350 (https://phabricator.wikimedia.org/T426044) [09:04:58] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:05:09] (03PS12) 10Aleksandar Mastilovic: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) [09:05:16] (03CR) 10Aleksandar Mastilovic: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [09:05:19] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2044: Upgrading es2044.codfw.wmnet [09:05:24] (03PS6) 10Trueg: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) [09:05:39] (03CR) 10Aleksandar Mastilovic: Presto memory tuning, resource groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [09:05:57] (03CR) 10Aleksandar Mastilovic: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [09:05:58] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool es2044: Upgrading es2044.codfw.wmnet [09:06:08] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [09:06:24] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2045: repool after maintenance es2045 [09:07:08] 06SRE: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#12028243 (10elukey) Added ACls in logging-codfw: ` elukey@kafka-logging2002:~$ bash /etc/kafka/acls.sh kafka-acls --bootstrap-server kafka-logging2001.codfw.wmnet:9092,kafka-logging2002.codfw.wmnet:9092,kafka-logging2003.codfw.... [09:07:45] !log add basic Kafka ACLs for anonymous to logging-codfw - T425528 (I'll add rollback steps in the task if needed) [09:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:49] T425528: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528 [09:08:23] (03CR) 10Aleksandar Mastilovic: Presto memory tuning, resource groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [09:09:26] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:09:37] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2203: Upgrading db2203.codfw.wmnet [09:09:48] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2203: Upgrading db2203.codfw.wmnet [09:10:47] 06SRE: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#12028264 (10elukey) To rollback the above if needed: * ssh to any kafka-logging2* * cat /etc/kafka/acls.sh * replace "--add" with "--remove" * Execute the commands using `sudo -E` [09:11:33] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2203.codfw.wmnet with OS trixie [09:12:54] (03PS1) 10Jcrespo: Revert^4 "dbbackups: Testing x1 backups on new cumin2003 trixie host" [puppet] - 10https://gerrit.wikimedia.org/r/1303352 [09:14:33] (03PS2) 10Jcrespo: Revert^4 "dbbackups: Testing x1 backups on new cumin2003 trixie host" [puppet] - 10https://gerrit.wikimedia.org/r/1303352 (https://phabricator.wikimedia.org/T427897) [09:14:41] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1303352 (https://phabricator.wikimedia.org/T427897) (owner: 10Jcrespo) [09:15:06] (03CR) 10Santiago Faci: Add Phabricator specific configuration for Test Kitchen (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) (owner: 10Clare Ming) [09:16:46] (03CR) 10Jcrespo: [C:03+2] Revert^4 "dbbackups: Testing x1 backups on new cumin2003 trixie host" [puppet] - 10https://gerrit.wikimedia.org/r/1303352 (https://phabricator.wikimedia.org/T427897) (owner: 10Jcrespo) [09:17:34] Should I merge mariadb: Promote db2212 to s1 master (27708f85b7) ? [09:17:57] cezmunsta: ^ [09:18:07] (03PS1) 10Dreamy Jazz: Drop $wgDiscussionToolsHCaptchaRequiredForAllEdits [extensions/DiscussionTools] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303354 (https://phabricator.wikimedia.org/T428883) [09:18:23] jouncebot: nowandnext [09:18:23] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [09:18:23] In 0 hour(s) and 41 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T1000) [09:18:30] Any problem with me using scap? [09:18:30] (03PS1) 10Blake: test_main: Remove redundant wildcards. [puppet] - 10https://gerrit.wikimedia.org/r/1303353 (https://phabricator.wikimedia.org/T428772) [09:21:24] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12028314 (10cmooney) >>! In T425088#12023306, @Jclark-ctr wrote: > Looks like this is failing with the provision script. @cmooney Said he can take a look later to resolve it > > > > ` > clo... [09:21:48] (03CR) 10Gmodena: dse-k8s-services: WDQS deployment helmfile values (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [09:22:15] Going to use scap shortly [09:23:00] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12028315 (10Jclark-ctr) Thank you! [09:23:36] (03PS1) 10Cathal Mooney: Provision script: improve error message if two prefixes found for vlan [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1303355 (https://phabricator.wikimedia.org/T425088) [09:23:57] (03CR) 10Gmodena: [C:03+1] "LGTM! LGTM! If helmfile works locally, let's merge and iterate on any issues in follow up patches." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [09:25:07] (03CR) 10Trueg: dse-k8s-services: WDQS deployment helmfile values (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [09:26:23] (03CR) 10Filippo Giunchedi: [C:03+1] Provision script: improve error message if two prefixes found for vlan [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1303355 (https://phabricator.wikimedia.org/T425088) (owner: 10Cathal Mooney) [09:26:31] !log testing x1 backups @ cumin2003 T427897 [09:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:35] T427897: Upgrade Cumin hosts to Trixie - https://phabricator.wikimedia.org/T427897 [09:26:42] (03Abandoned) 10Marostegui: db-production.php: Disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303288 (https://phabricator.wikimedia.org/T429118) (owner: 10Marostegui) [09:27:43] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2203.codfw.wmnet with reason: host reimage [09:28:00] (03PS1) 10Dreamy Jazz: hCaptcha: Remove config for VE and DT enable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303356 (https://phabricator.wikimedia.org/T428883) [09:28:02] (03PS1) 10Gkyziridis: ml-services: Deploy Qwen3.6-27B-FP8 model in experimental ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303357 (https://phabricator.wikimedia.org/T425680) [09:29:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set es6 eqiad as read-only for maintenance - T429436', diff saved to https://phabricator.wikimedia.org/P94222 and previous config saved to /var/cache/conftool/dbconfig/20260617-092913-marostegui.json [09:29:19] T429436: Switchover es6 master (es1038 -> es1037) - https://phabricator.wikimedia.org/T429436 [09:29:29] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs1002.eqiad.wmnet with OS bookworm [09:29:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: Primary switchover es6 T429436 [09:29:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303356 (https://phabricator.wikimedia.org/T428883) (owner: 10Dreamy Jazz) [09:29:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/DiscussionTools] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303354 (https://phabricator.wikimedia.org/T428883) (owner: 10Dreamy Jazz) [09:29:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set es1037 with weight 0 T429436', diff saved to https://phabricator.wikimedia.org/P94223 and previous config saved to /var/cache/conftool/dbconfig/20260617-092940-marostegui.json [09:29:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#12028337 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host dse-k8s-wdqs100... [09:31:04] (03Merged) 10jenkins-bot: hCaptcha: Remove config for VE and DT enable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303356 (https://phabricator.wikimedia.org/T428883) (owner: 10Dreamy Jazz) [09:31:31] (03Merged) 10jenkins-bot: Drop $wgDiscussionToolsHCaptchaRequiredForAllEdits [extensions/DiscussionTools] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303354 (https://phabricator.wikimedia.org/T428883) (owner: 10Dreamy Jazz) [09:31:59] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1303286 (https://phabricator.wikimedia.org/T429436) (owner: 10Gerrit maintenance bot) [09:32:04] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1303356|hCaptcha: Remove config for VE and DT enable (T428883)]], [[gerrit:1303354|Drop $wgDiscussionToolsHCaptchaRequiredForAllEdits (T428883)]] [09:32:10] T428883: hCaptcha: Drop $wgDiscussionToolsHCaptchaRequiredForAllEdits - https://phabricator.wikimedia.org/T428883 [09:32:17] !log Starting es6 eqiad failover from es1038 to es1037 - T429436 [09:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:46] (03CR) 10Gmodena: [C:03+1] Added DNS entries for the new WDQS 2 deployments in DSE K8s. (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1301301 (https://phabricator.wikimedia.org/T428925) (owner: 10Trueg) [09:33:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es1037 to es6 primary T429436', diff saved to https://phabricator.wikimedia.org/P94224 and previous config saved to /var/cache/conftool/dbconfig/20260617-093310-marostegui.json [09:33:49] (03Abandoned) 10Marostegui: wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1303287 (https://phabricator.wikimedia.org/T429436) (owner: 10Gerrit maintenance bot) [09:34:29] (03PS1) 10Marostegui: wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1303359 (https://phabricator.wikimedia.org/T429436) [09:34:31] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2203.codfw.wmnet with reason: host reimage [09:34:32] (03PS2) 10Gkyziridis: ml-services: Deploy Qwen3.6-27B-FP8 model in experimental ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303357 (https://phabricator.wikimedia.org/T425680) [09:34:48] (03PS1) 10FNegri: aptrepo: move wikireplicas-utils to trixie [puppet] - 10https://gerrit.wikimedia.org/r/1303360 (https://phabricator.wikimedia.org/T351637) [09:35:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12028369 (10Jclark-ctr) I will look again in an hour when I get on site [09:35:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1038 T429436', diff saved to https://phabricator.wikimedia.org/P94225 and previous config saved to /var/cache/conftool/dbconfig/20260617-093513-marostegui.json [09:35:19] T429436: Switchover es6 master (es1038 -> es1037) - https://phabricator.wikimedia.org/T429436 [09:35:30] (03CR) 10Marostegui: [C:03+2] wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1303359 (https://phabricator.wikimedia.org/T429436) (owner: 10Marostegui) [09:35:31] (03PS2) 10FNegri: aptrepo: move wikireplicas-utils to trixie [puppet] - 10https://gerrit.wikimedia.org/r/1303360 (https://phabricator.wikimedia.org/T351637) [09:35:34] !log marostegui@dns1004 START - running authdns-update [09:36:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set es6 eqiad back to read-write - T429436', diff saved to https://phabricator.wikimedia.org/P94226 and previous config saved to /var/cache/conftool/dbconfig/20260617-093559-marostegui.json [09:36:22] (03CR) 10Filippo Giunchedi: [C:03+1] aptrepo: move wikireplicas-utils to trixie [puppet] - 10https://gerrit.wikimedia.org/r/1303360 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri) [09:36:55] (03CR) 10FNegri: [C:03+2] aptrepo: move wikireplicas-utils to trixie [puppet] - 10https://gerrit.wikimedia.org/r/1303360 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri) [09:37:19] !log marostegui@dns1004 END - running authdns-update [09:37:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:37:52] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1038: Upgrading es1038.eqiad.wmnet [09:38:02] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1303356|hCaptcha: Remove config for VE and DT enable (T428883)]], [[gerrit:1303354|Drop $wgDiscussionToolsHCaptchaRequiredForAllEdits (T428883)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:38:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1038: Upgrading es1038.eqiad.wmnet [09:38:05] T428883: hCaptcha: Drop $wgDiscussionToolsHCaptchaRequiredForAllEdits - https://phabricator.wikimedia.org/T428883 [09:38:53] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1038.eqiad.wmnet with OS trixie [09:39:17] (03PS1) 10FNegri: aptrepo: move wikireplicas-utils to trixie, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/1303361 (https://phabricator.wikimedia.org/T351637) [09:39:32] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs1002.eqiad.wmnet with OS bookworm [09:39:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#12028413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host dse-k8s-wdqs1002.eq... [09:41:04] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [09:45:59] (03CR) 10Clément Goubert: [C:03+1] test_main: Remove redundant wildcards. [puppet] - 10https://gerrit.wikimedia.org/r/1303353 (https://phabricator.wikimedia.org/T428772) (owner: 10Blake) [09:47:35] (03PS1) 10Sergio Gimeno: migrateMentorStatusAway: Return SIMULATED for all dry-run executions [extensions/GrowthExperiments] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303364 (https://phabricator.wikimedia.org/T409170) [09:47:35] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:47:37] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1303356|hCaptcha: Remove config for VE and DT enable (T428883)]], [[gerrit:1303354|Drop $wgDiscussionToolsHCaptchaRequiredForAllEdits (T428883)]] (duration: 15m 32s) [09:47:42] T428883: hCaptcha: Drop $wgDiscussionToolsHCaptchaRequiredForAllEdits - https://phabricator.wikimedia.org/T428883 [09:47:52] (03PS1) 10Sergio Gimeno: migrateMentorStatusAway: Return SIMULATED for all dry-run executions [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303365 (https://phabricator.wikimedia.org/T409170) [09:48:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303365 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [09:48:24] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs1002.eqiad.wmnet with OS bookworm [09:48:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/GrowthExperiments] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303364 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [09:48:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#12028451 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host dse-k8s-wdqs100... [09:50:48] (03PS1) 10Jelto: helm: remove helm311 package and make helm317 default [puppet] - 10https://gerrit.wikimedia.org/r/1303367 (https://phabricator.wikimedia.org/T341984) [09:50:53] (03PS1) 10Jelto: helm: install helm317 and helm319 in parallel [puppet] - 10https://gerrit.wikimedia.org/r/1303368 (https://phabricator.wikimedia.org/T341984) [09:51:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2045: repool after maintenance es2045 [09:51:55] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303369 (https://phabricator.wikimedia.org/T429456) [09:52:08] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2203.codfw.wmnet with OS trixie [09:52:15] (03CR) 10Jelto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1303367 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [09:52:20] (03CR) 10Jelto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1303368 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [09:52:21] (03CR) 10Marostegui: [C:03+1] Cookbook sre.mysql.upgrade should not accept multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) (owner: 10CWilliams) [09:54:39] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1038.eqiad.wmnet with reason: host reimage [09:55:28] (03PS26) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [09:55:30] (03PS7) 10Trueg: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) [09:57:53] (03PS3) 10Trueg: Added DNS entries for the new WDQS 2 deployments in DSE K8s. [dns] - 10https://gerrit.wikimedia.org/r/1301301 (https://phabricator.wikimedia.org/T428925) [09:58:00] (03CR) 10CI reject: [V:04-1] dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) (owner: 10Trueg) [09:58:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1038.eqiad.wmnet with reason: host reimage [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T1000) [10:00:11] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-wdqs1002.eqiad.wmnet with reason: host reimage [10:01:59] (03CR) 10Blake: [C:03+2] test_main: Remove redundant wildcards. [puppet] - 10https://gerrit.wikimedia.org/r/1303353 (https://phabricator.wikimedia.org/T428772) (owner: 10Blake) [10:02:37] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2203: Migration of db2203.codfw.wmnet completed [10:04:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-wdqs1002.eqiad.wmnet with reason: host reimage [10:05:51] (03CR) 10Brouberol: [C:03+1] stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303369 (https://phabricator.wikimedia.org/T429456) (owner: 10JavierMonton) [10:06:39] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303369 (https://phabricator.wikimedia.org/T429456) (owner: 10JavierMonton) [10:07:35] FIRING: [4x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:44] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303369 (https://phabricator.wikimedia.org/T429456) (owner: 10JavierMonton) [10:09:38] btullis@cumin1003 reimage (PID 423755) is awaiting input [10:09:57] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:10:28] (03CR) 10Blake: [C:03+2] mediawiki: Use utf-8 for text/plain and text/html. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1301338 (https://phabricator.wikimedia.org/T428772) (owner: 10Blake) [10:10:37] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:10:48] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs2001.codfw.wmnet with OS bookworm [10:12:36] !log cumin -x 'A:swift-fe' "disable-puppet 'Disabling puppet for ratelimit deploy - cgoubert'" [10:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:20] (03CR) 10Clément Goubert: [C:03+2] tls_terminator: Convert size to kB for rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1302772 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [10:15:27] (03CR) 10Fabfur: cache::haproxy: using intermediate variable for logging x-provenance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1302874 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [10:15:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1038.eqiad.wmnet with OS trixie [10:16:26] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:17:56] !log cumin -x 'A:swift-fe' "enable-puppet 'Disabling puppet for ratelimit deploy - cgoubert'" [10:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:27] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs2001.codfw.wmnet with OS bookworm [10:22:24] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [10:22:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1038: Migration of es1038.eqiad.wmnet completed [10:23:33] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [10:23:51] (03PS2) 10Clément Goubert: ratelimit-media: Limits in kB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300714 (https://phabricator.wikimedia.org/T414440) [10:25:29] btullis@cumin1003 reimage (PID 420613) is awaiting input [10:25:30] (03PS1) 10Slyngshede: IDP: Bump local version, 7.3.7.2+wmf13u2 [dns] - 10https://gerrit.wikimedia.org/r/1303380 [10:28:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [10:28:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-wdqs1002.eqiad.wmnet with OS bookworm [10:28:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#12028538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host dse-k8s-wdqs1002.eq... [10:28:49] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs2002.codfw.wmnet with OS bookworm [10:29:49] !log installing git-lfs security updates [10:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:40] (03CR) 10Clément Goubert: "Going to self-merge this because it's transparent right now due to global shadow mode" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300714 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [10:31:42] (03CR) 10Clément Goubert: [C:03+2] ratelimit-media: Limits in kB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300714 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [10:31:51] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs1003.eqiad.wmnet with OS bookworm [10:32:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#12028548 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host dse-k8s-wdqs100... [10:33:51] (03Merged) 10jenkins-bot: ratelimit-media: Limits in kB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300714 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [10:34:25] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply [10:34:33] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [10:34:51] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply [10:35:12] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [10:35:18] (03CR) 10Kevin Bazira: ml-services: Deploy Qwen3.6-27B-FP8 model in experimental ns. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303357 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [10:35:18] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [10:35:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [10:35:38] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [10:35:40] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2044: Upgrading es2044.codfw.wmnet [10:35:54] (03CR) 10Effie Mouzeli: [C:03+1] mcrouter_wancache: Remove 2 gutterpool servers for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/1303350 (https://phabricator.wikimedia.org/T426044) (owner: 10Blake) [10:36:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2044: Upgrading es2044.codfw.wmnet [10:37:06] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2044.codfw.wmnet with OS trixie [10:38:38] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs2002.codfw.wmnet with OS bookworm [10:41:24] (03PS3) 10Gkyziridis: ml-services: Deploy Qwen3.6-27B-FP8 model in experimental ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303357 (https://phabricator.wikimedia.org/T425680) [10:42:17] (03CR) 10Gkyziridis: ml-services: Deploy Qwen3.6-27B-FP8 model in experimental ns. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303357 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [10:43:38] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-wdqs1003.eqiad.wmnet with reason: host reimage [10:47:00] (03CR) 10Muehlenhoff: [C:03+2] Disable Debian mirror sync [puppet] - 10https://gerrit.wikimedia.org/r/1302838 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [10:48:08] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2203: Migration of db2203.codfw.wmnet completed [10:48:09] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:50:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-wdqs1003.eqiad.wmnet with reason: host reimage [10:51:39] (03PS1) 10Muehlenhoff: mirrors: Fix SSH key config [puppet] - 10https://gerrit.wikimedia.org/r/1303384 (https://phabricator.wikimedia.org/T416707) [10:53:02] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Deploy Qwen3.6-27B-FP8 model in experimental ns. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303357 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [10:53:23] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2044.codfw.wmnet with reason: host reimage [10:57:47] (03CR) 10Muehlenhoff: [C:03+2] mirrors: Fix SSH key config [puppet] - 10https://gerrit.wikimedia.org/r/1303384 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [10:58:41] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy Qwen3.6-27B-FP8 model in experimental ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303357 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [10:59:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2044.codfw.wmnet with reason: host reimage [10:59:30] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [10:59:50] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1163: Upgrading db1163.eqiad.wmnet [10:59:56] !log btullis@cumin1003 START - Cookbook sre.hosts.dhcp for host dse-k8s-wdqs2001.codfw.wmnet [11:00:05] mvolz: Your horoscope predicts another Services – Citoid / Zotero deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T1100). [11:00:20] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1163: Upgrading db1163.eqiad.wmnet [11:00:21] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#12028654 (10MoritzMuehlenhoff) [11:00:51] (03Merged) 10jenkins-bot: ml-services: Deploy Qwen3.6-27B-FP8 model in experimental ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303357 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [11:01:30] !log The Debian mirror on mirrors.wikimedia.org has been disabled T416707 [11:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:35] T416707: Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707 [11:02:39] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:02:55] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:03:00] btullis@cumin1003 dhcp (PID 434358) is awaiting input [11:04:42] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1163.eqiad.wmnet with OS trixie [11:04:58] (03PS1) 10Muehlenhoff: mirrors: Add a link to the announcement to the index page until full decom [puppet] - 10https://gerrit.wikimedia.org/r/1303386 (https://phabricator.wikimedia.org/T416707) [11:08:11] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1038: Migration of es1038.eqiad.wmnet completed [11:08:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [11:08:30] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [11:09:21] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:09:49] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:10:10] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:11:19] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:11:35] btullis@cumin1003 reimage (PID 430737) is awaiting input [11:11:47] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:11:48] (03PS1) 10Clément Goubert: ratelimit: Unify statsd-exporter labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303389 [11:12:00] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [11:12:01] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-wdqs1003.eqiad.wmnet with OS bookworm [11:12:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#12028725 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host dse-k8s-wdqs1003.eq... [11:12:37] (03CR) 10Muehlenhoff: [C:03+2] mirrors: Add a link to the announcement to the index page until full decom [puppet] - 10https://gerrit.wikimedia.org/r/1303386 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [11:13:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:13:40] (03PS1) 10Gkyziridis: ml-services: Deploy Qwen3.6-27B-FP8 model in experimental ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303390 (https://phabricator.wikimedia.org/T425680) [11:15:00] (03PS1) 10Muehlenhoff: mirrors: Remove firewall rule for rsync access [puppet] - 10https://gerrit.wikimedia.org/r/1303391 (https://phabricator.wikimedia.org/T416707) [11:15:33] (03CR) 10CI reject: [V:04-1] mirrors: Remove firewall rule for rsync access [puppet] - 10https://gerrit.wikimedia.org/r/1303391 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [11:16:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2044.codfw.wmnet with OS trixie [11:17:28] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [11:17:35] (03PS2) 10Muehlenhoff: mirrors: Remove firewall rule for rsync access [puppet] - 10https://gerrit.wikimedia.org/r/1303391 (https://phabricator.wikimedia.org/T416707) [11:17:54] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2044: repool after maintenance es2044 [11:18:04] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [11:18:07] (03CR) 10CI reject: [V:04-1] mirrors: Remove firewall rule for rsync access [puppet] - 10https://gerrit.wikimedia.org/r/1303391 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [11:18:27] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1163.eqiad.wmnet with reason: host reimage [11:19:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1190.eqiad.wmnet with reason: upgrading [11:20:30] (03PS3) 10Muehlenhoff: mirrors: Remove firewall rule for rsync access [puppet] - 10https://gerrit.wikimedia.org/r/1303391 (https://phabricator.wikimedia.org/T416707) [11:20:58] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy Qwen3.6-27B-FP8 model in experimental ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303390 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [11:21:14] !log marostegui@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1:00:00 on db1171.eqiad.wmnet with reason: upgrading [11:22:10] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host dse-k8s-wdqs2001.codfw.wmnet [11:22:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1172.eqiad.wmnet with reason: upgrading [11:23:14] (03Merged) 10jenkins-bot: ml-services: Deploy Qwen3.6-27B-FP8 model in experimental ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303390 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [11:23:32] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1163.eqiad.wmnet with reason: host reimage [11:23:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:23:41] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs2002.codfw.wmnet with OS bookworm [11:23:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#12028758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btulli... [11:25:25] btullis@cumin1003 reimage (PID 442774) is awaiting input [11:25:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#12028765 (10BTullis) [11:25:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#12028770 (10BTullis) 05Open→03Resolved [11:26:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1191.eqiad.wmnet with reason: upgrading [11:27:01] (03CR) 10Muehlenhoff: [C:03+2] mirrors: Remove firewall rule for rsync access [puppet] - 10https://gerrit.wikimedia.org/r/1303391 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [11:27:42] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:29:24] (03CR) 10Gkyziridis: [C:03+2] "There was an issue with the "dot" in the naming of the service: `Qwen3.6`. I filed already a hotfix and deploy it: https://gerrit.wikimedi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303357 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [11:30:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/scholarly-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:33:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12028802 (10Jclark-ctr) I did find a different error. I’m reaching back out to Dell to update them. Additionally, I will be running a hardware stress test on it over the next 12 hours. `... [11:34:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12028804 (10Marostegui) Thank you John! The host can be rebooted anytime whenever you need. [11:35:55] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-wdqs2002.codfw.wmnet with reason: host reimage [11:40:15] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-wdqs2002.codfw.wmnet with reason: host reimage [11:40:21] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1163.eqiad.wmnet with OS trixie [11:41:32] (03PS1) 10Muehlenhoff: Remove Debian mirror code [puppet] - 10https://gerrit.wikimedia.org/r/1303396 (https://phabricator.wikimedia.org/T416707) [11:42:36] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:42:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1045: Upgrading es1045.eqiad.wmnet [11:43:46] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1045: Upgrading es1045.eqiad.wmnet [11:44:29] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of firewall services without srange - https://phabricator.wikimedia.org/T149804#12028816 (10MoritzMuehlenhoff) [11:44:40] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1045.eqiad.wmnet with OS trixie [11:47:47] 06SRE, 06Infrastructure-Foundations, 10netops: Create a cookbook to add tagged_vlans to cloudsw ports - https://phabricator.wikimedia.org/T429466 (10cmooney) 03NEW p:05Triage→03Low [11:48:02] (03PS1) 10Cathal Mooney: WIP: create cookbook to configure switch port vlans for cloud hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1303397 (https://phabricator.wikimedia.org/T429466) [11:50:47] (03CR) 10CI reject: [V:04-1] WIP: create cookbook to configure switch port vlans for cloud hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1303397 (https://phabricator.wikimedia.org/T429466) (owner: 10Cathal Mooney) [11:51:09] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1163: Migration of db1163.eqiad.wmnet completed [11:51:20] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs2002.codfw.wmnet with OS bookworm [11:51:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#12028863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cu... [11:54:38] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [11:55:06] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [11:55:07] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [11:55:35] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [11:55:59] (03CR) 10Cathal Mooney: [C:03+2] Interface ACL attachment - base on description not static yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1300900 (https://phabricator.wikimedia.org/T428886) (owner: 10Cathal Mooney) [11:57:39] (03Merged) 10jenkins-bot: Interface ACL attachment - base on description not static yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1300900 (https://phabricator.wikimedia.org/T428886) (owner: 10Cathal Mooney) [12:00:03] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1045.eqiad.wmnet with reason: host reimage [12:00:07] (03CR) 10Muehlenhoff: [C:03+2] Remove Debian mirror code [puppet] - 10https://gerrit.wikimedia.org/r/1303396 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [12:00:43] (03PS1) 10Jcrespo: dbbackups: Migrate all backup snapshots from cumin2002 to cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1303401 (https://phabricator.wikimedia.org/T427897) [12:00:49] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [12:01:39] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1303401 (https://phabricator.wikimedia.org/T427897) (owner: 10Jcrespo) [12:01:53] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [12:02:23] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs2002.codfw.wmnet with OS bookworm [12:02:26] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs2001.codfw.wmnet with OS bookworm [12:02:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#12028929 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btulli... [12:02:35] FIRING: [4x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#12028928 (10Jclark-ctr) a:05BTullis→03Jclark-ctr [12:03:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2044: repool after maintenance es2044 [12:03:25] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1078.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:03:27] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [12:03:31] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [12:04:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1045.eqiad.wmnet with reason: host reimage [12:05:01] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-wdqs2002.codfw.wmnet with reason: host reimage [12:05:35] (03PS1) 10Muehlenhoff: mirrors: Disable mirror_age_metrics metric [puppet] - 10https://gerrit.wikimedia.org/r/1303405 (https://phabricator.wikimedia.org/T416707) [12:05:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/scholarly-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:06:09] (03PS1) 10Btullis: Bring dse-k8s-wdqs100[1-3] into service [puppet] - 10https://gerrit.wikimedia.org/r/1303406 (https://phabricator.wikimedia.org/T423314) [12:07:38] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [12:07:43] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [12:07:43] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [12:07:46] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [12:07:47] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [12:07:50] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [12:08:53] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [12:09:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-wdqs2002.codfw.wmnet with reason: host reimage [12:12:07] (03PS2) 10Btullis: Bring dse-k8s-wdqs100[1-3] into service [puppet] - 10https://gerrit.wikimedia.org/r/1303406 (https://phabricator.wikimedia.org/T423314) [12:13:06] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1078.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:14:36] (03CR) 10Jcrespo: [C:03+2] dbbackups: Migrate all backup snapshots from cumin2002 to cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1303401 (https://phabricator.wikimedia.org/T427897) (owner: 10Jcrespo) [12:15:07] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-wdqs2001.codfw.wmnet with reason: host reimage [12:15:22] (03CR) 10Brouberol: "Don't we also need to add a specific node label/taint?" [puppet] - 10https://gerrit.wikimedia.org/r/1303406 (https://phabricator.wikimedia.org/T423314) (owner: 10Btullis) [12:16:35] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs2003.codfw.wmnet with OS bookworm [12:16:42] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs2004.codfw.wmnet with OS bookworm [12:16:46] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#12028948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btulli... [12:16:50] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#12028949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btulli... [12:18:20] (03CR) 10Btullis: "I believe that will be set automatically by this: https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/dse" [puppet] - 10https://gerrit.wikimedia.org/r/1303406 (https://phabricator.wikimedia.org/T423314) (owner: 10Btullis) [12:18:50] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Cumin hosts to Trixie - https://phabricator.wikimedia.org/T427897#12028951 (10jcrespo) @MoritzMuehlenhoff The test worked succesfully, thus, I've migrated the backups from cumin2002 to cumin2003. The backups are setup, but disabled on the old h... [12:19:07] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [12:19:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-wdqs2001.codfw.wmnet with reason: host reimage [12:20:50] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Cumin hosts to Trixie - https://phabricator.wikimedia.org/T427897#12028954 (10MoritzMuehlenhoff) >>! In T427897#12028951, @jcrespo wrote: > @MoritzMuehlenhoff The test worked succesfully, thus, I've migrated the backups from cumin2002 to cumin2... [12:21:00] FIRING: [8x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-wdqs1001:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [12:21:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1045.eqiad.wmnet with OS trixie [12:22:12] btullis@cumin1003 reimage (PID 451844) is awaiting input [12:22:30] (03CR) 10Blake: [C:03+2] mcrouter_wancache: Remove 2 gutterpool servers for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/1303350 (https://phabricator.wikimedia.org/T426044) (owner: 10Blake) [12:22:35] FIRING: [2x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:22:57] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [12:23:30] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump changelog for 1.0.6 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1302860 (owner: 10Muehlenhoff) [12:23:47] !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Security updates (T426585) - klausman@cumin1003 [12:23:54] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1045: repool after upgrade [12:24:15] !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Security updates (T426585) - klausman@cumin1003 [12:25:51] (03CR) 10Filippo Giunchedi: [C:03+1] aptrepo: move wikireplicas-utils to trixie, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/1303361 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri) [12:28:35] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-wdqs2003.codfw.wmnet with reason: host reimage [12:29:03] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-wdqs2004.codfw.wmnet with reason: host reimage [12:32:11] !log blake@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [12:32:27] !log blake@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [12:32:40] !log blake@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [12:32:45] !log blake@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [12:33:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [12:33:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-wdqs2002.codfw.wmnet with OS bookworm [12:34:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#12028993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cu... [12:35:07] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-wdqs2003.codfw.wmnet with reason: host reimage [12:36:18] (03PS1) 10Tiziano Fogli: sloth: add reader-growth task receiver [puppet] - 10https://gerrit.wikimedia.org/r/1303411 (https://phabricator.wikimedia.org/T428617) [12:36:29] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [12:36:38] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1163: Migration of db1163.eqiad.wmnet completed [12:36:39] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:37:16] (03CR) 10Fabfur: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1302230 (owner: 10BCornwall) [12:37:53] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [12:39:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-wdqs2004.codfw.wmnet with reason: host reimage [12:40:35] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc-gp1005.eqiad.wmnet with OS trixie [12:40:51] (03CR) 10Jelto: [C:03+1] "lgtm to me now, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1301233 (https://phabricator.wikimedia.org/T428979) (owner: 10Arnaudb) [12:40:58] btullis@cumin1003 reimage (PID 442774) is awaiting input [12:41:03] !log blake@cumin1003 START - Cookbook sre.hosts.move-vlan for host mc-gp1005 [12:41:28] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc-gp1006.eqiad.wmnet with OS trixie [12:41:39] (03CR) 10Arnaudb: [C:03+2] gerrit: add 5xx and 4xx alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1301233 (https://phabricator.wikimedia.org/T428979) (owner: 10Arnaudb) [12:41:41] !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Security updates (T426585) - klausman@cumin1003 [12:41:49] !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Security updates (T426585) - klausman@cumin1003 [12:41:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [12:41:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-wdqs2001.codfw.wmnet with OS bookworm [12:41:56] !log blake@cumin1003 START - Cookbook sre.hosts.move-vlan for host mc-gp1006 [12:43:20] !log blake@cumin1003 START - Cookbook sre.dns.netbox [12:44:01] (03Merged) 10jenkins-bot: gerrit: add 5xx and 4xx alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1301233 (https://phabricator.wikimedia.org/T428979) (owner: 10Arnaudb) [12:44:53] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12029028 (10elukey) @cmooney cloudvirt1078.eqiad.wmnet leads to a NX domain, but I see something provisioned in netbox: https://netbox.wikimedia.org/dcim/devices/6773/inter... [12:45:12] (03CR) 10FNegri: [C:03+2] aptrepo: move wikireplicas-utils to trixie, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/1303361 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri) [12:45:29] !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:ml-cache-eqiad [12:45:41] !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:ml-cache-codfw [12:47:03] (03CR) 10Elukey: [C:03+1] sloth: add reader-growth task receiver [puppet] - 10https://gerrit.wikimedia.org/r/1303411 (https://phabricator.wikimedia.org/T428617) (owner: 10Tiziano Fogli) [12:48:40] FIRING: [3x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:48:54] !log blake@cumin1003 START - Cookbook sre.dns.netbox [12:49:23] !log blake@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host mc-gp1005 - blake@cumin1003" [12:49:27] !log blake@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host mc-gp1005 - blake@cumin1003" [12:49:28] !log blake@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:49:28] !log blake@cumin1003 START - Cookbook sre.dns.wipe-cache mc-gp1005.eqiad.wmnet 126.32.64.10.in-addr.arpa 6.2.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:49:31] !log blake@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mc-gp1005.eqiad.wmnet 126.32.64.10.in-addr.arpa 6.2.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:49:32] !log blake@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host mc-gp1005 [12:51:00] FIRING: [10x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-wdqs1001:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [12:51:01] !log blake@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-gp1005 [12:51:01] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host mc-gp1005 [12:51:49] !log blake@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:51:50] !log blake@cumin1003 START - Cookbook sre.dns.wipe-cache mc-gp1006.eqiad.wmnet 182.48.64.10.in-addr.arpa 2.8.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:51:53] !log blake@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mc-gp1006.eqiad.wmnet 182.48.64.10.in-addr.arpa 2.8.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:51:54] !log blake@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host mc-gp1006 [12:52:20] !log blake@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-gp1006 [12:52:20] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host mc-gp1006 [12:52:27] (03CR) 10Muehlenhoff: [C:03+2] admin-ng: Allow the new URL downloaders [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303341 (https://phabricator.wikimedia.org/T427282) (owner: 10Muehlenhoff) [12:53:31] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [12:53:40] RESOLVED: [4x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:23] PROBLEM - Host an-worker1231 is DOWN: PING CRITICAL - Packet loss = 100% [12:54:52] 06SRE, 06Infrastructure-Foundations: Migrate diffscan VM to Trixie - https://phabricator.wikimedia.org/T415347#12029053 (10ayounsi) 05Open→03In progress p:05Low→03Medium I created diffscan03 and applied the same puppet config. Let's see if the daily diffscan and weekly peeringdb timers/script work as e... [12:55:43] PROBLEM - Host an-worker1232 is DOWN: PING CRITICAL - Packet loss = 100% [12:55:55] 06SRE, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - MySQLReplicaNotUsingGTID - https://phabricator.wikimedia.org/T427469#12029056 (10Marostegui) @tappof so far we've not seen false positives today. I tried generating an alert and it seems to be working: ` [13:20:05] ... [12:56:36] btullis@cumin1003 reimage (PID 457276) is awaiting input [12:58:24] 06SRE, 06DBA, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - MySQLReplicaNotUsingGTID - https://phabricator.wikimedia.org/T427469#12029060 (10Marostegui) a:05tappof→03None [12:58:41] 06SRE, 06DBA, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - MySQLReplicaNotUsingGTID - https://phabricator.wikimedia.org/T427469#12029066 (10Marostegui) a:03tappof [12:58:53] RECOVERY - Host an-worker1232 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [12:59:37] FIRING: [8x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:59:56] RECOVERY - Host an-worker1231 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [13:00:05] Lucas_WMDE, urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T1300). Please do the needful. [13:00:05] yerdua_wmde, danisztls, and abijeet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:11] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [13:00:16] o/ [13:00:39] I can deploy [13:01:03] I can self-deploy [13:01:06] (03CR) 10CDanis: cache::haproxy: using intermediate variable for logging x-provenance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1302874 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [13:02:28] Lucas_WMDE: should I start? [13:02:51] I’m reviewing the first queued config change at the moment [13:03:03] trying to figure out if the necessary code is even deployed [13:03:11] danisztls: I think you can start with your change, yeah [13:03:12] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:03:16] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:03:16] btullis@cumin1003 reimage (PID 457310) is awaiting input [13:04:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302998 (https://phabricator.wikimedia.org/T428876) (owner: 10DDesouza) [13:04:36] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T429474 (10seanleong-WMDE) 03NEW [13:04:37] RESOLVED: [8x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:05:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:06:11] FIRING: [8x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:06:21] (03Merged) 10jenkins-bot: Add English Wikipedia Mobile App Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302998 (https://phabricator.wikimedia.org/T428876) (owner: 10DDesouza) [13:06:38] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1005.eqiad.wmnet with reason: host reimage [13:06:47] !log dani@deploy1003 Started scap sync-world: Backport for [[gerrit:1302998|Add English Wikipedia Mobile App Survey (T428876)]] [13:06:52] T428876: Quick survey on Wikipedia - Mobile App Survey (WP25) - https://phabricator.wikimedia.org/T428876 [13:07:43] yerdua_wmde: should we deploy the two config changes separately or together? [13:07:46] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1006.eqiad.wmnet with reason: host reimage [13:08:46] !log dani@deploy1003 dani: Backport for [[gerrit:1302998|Add English Wikipedia Mobile App Survey (T428876)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:09:10] PROBLEM - Host an-master1004 is DOWN: PING CRITICAL - Packet loss = 100% [13:09:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1045: repool after upgrade [13:09:26] Lucas_WMDE I think doing them together is fine [13:09:33] (03CR) 10Papaul: [C:03+2] Add interface irb.900 to security zone mgmt [homer/public] - 10https://gerrit.wikimedia.org/r/1302337 (https://phabricator.wikimedia.org/T421674) (owner: 10Papaul) [13:09:37] FIRING: [9x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:52] FIRING: [11x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:10:10] unless there's any technical limitations of deployment that I don't know [13:10:17] !log dani@deploy1003 dani: Continuing with deployment [13:10:30] !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:ml-cache-eqiad [13:10:37] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1005.eqiad.wmnet with reason: host reimage [13:10:49] yeah, I think that makes more sense too [13:11:08] !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:ml-cache-codfw [13:11:11] FIRING: [8x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:11:41] PROBLEM - Host mc-gp1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:12:35] o/ [13:13:33] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [13:14:36] Lucas_WMDE, Hello. If we have time, all of the ULS rewrite patches can be merged and backported at once. [13:14:36] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1006.eqiad.wmnet with reason: host reimage [13:14:37] RESOLVED: [8x] ProbeDown: Service ml-cache1002-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:14:41] !log dani@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302998|Add English Wikipedia Mobile App Survey (T428876)]] (duration: 07m 53s) [13:14:45] T428876: Quick survey on Wikipedia - Mobile App Survey (WP25) - https://phabricator.wikimedia.org/T428876 [13:15:22] all done [13:15:27] Lucas_WMDE: thanks! [13:15:37] ack [13:15:56] I wanted to try to get a few more benchmarks in before deploying our config changes [13:16:11] FIRING: [8x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:16:37] (03CR) 10Ssingh: "For this specific commit (adding to langlist), no, I don't Traffic has been involved, or needs to be in that respect. Since the lang codes" [dns] - 10https://gerrit.wikimedia.org/r/1302196 (https://phabricator.wikimedia.org/T429189) (owner: 10Dzahn) [13:16:42] RECOVERY - Host mc-gp1006 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [13:17:00] and then probably do the ULS backports after that [13:17:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302956 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [13:17:54] PROBLEM - Host msw1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:04] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:12] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:18:13] yerdua_wmde: I have a suspicion that some items might get duplicate WikiProject links… but let’s try that out on mwdebug [13:18:20] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:18:36] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:36] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:36] PROBLEM - Host ps1-c4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:36] PROBLEM - Host ps1-c5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:36] PROBLEM - Host ps1-c2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:36] PROBLEM - Host ps1-b8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:37] PROBLEM - Host ps1-b5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:38] PROBLEM - Host ps1-b6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:38] PROBLEM - Host ps1-f3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:38] PROBLEM - Host ps1-f4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:39] PROBLEM - Host ps1-f5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:40] PROBLEM - Host ps1-f1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:42] hm [13:19:02] PROBLEM - Host ps1-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:02] PROBLEM - Host ps1-a2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:02] PROBLEM - Host ps1-b3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:02] PROBLEM - Host ps1-a3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:02] PROBLEM - Host ps1-a8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:02] PROBLEM - Host ps1-b2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:03] PROBLEM - Host ps1-b4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:03] PROBLEM - Host ps1-d2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:04] PROBLEM - Host ps1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:04] PROBLEM - Host ps1-a6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:05] PROBLEM - Host ps1-c8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:05] PROBLEM - Host ps1-c6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:06] PROBLEM - Host ps1-e4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:06] PROBLEM - Host ps1-e1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:07] PROBLEM - Host ps1-b1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:08] PROBLEM - Host ps1-d1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:08] PROBLEM - Host ps1-e2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:08] PROBLEM - Host ps1-e5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:09] PROBLEM - Host ps1-c7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:09] PROBLEM - Host ps1-e3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:10] PROBLEM - Host ps1-d3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:10] PROBLEM - Host ps1-d4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:11] PROBLEM - Host ps1-d5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:11] PROBLEM - Host ps1-d7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:12] PROBLEM - Host ps1-d6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:12] PROBLEM - Host ps1-f2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:13] PROBLEM - Host ps1-d8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:13] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 33, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:19:15] * Lucas_WMDE refrains from giving spiderpig the go-ahead [13:19:24] PROBLEM - Host ps1-a4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:24] PROBLEM - Host ps1-a5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:19:34] 0o [13:19:51] I can still ssh to bast2003 (codfw) at least… [13:19:55] topranks, papaul: any work in progress? [13:20:02] RECOVERY - Host ps1-d1-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.62 ms [13:20:02] RECOVERY - Host ps1-d2-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.51 ms [13:20:02] RECOVERY - Host ps1-d4-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.65 ms [13:20:02] RECOVERY - Host ps1-d3-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.52 ms [13:20:02] RECOVERY - Host ps1-d5-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.65 ms [13:20:03] RECOVERY - Host ps1-c7-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [13:20:03] RECOVERY - Host ps1-c4-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.52 ms [13:20:03] RECOVERY - Host ps1-d6-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.28 ms [13:20:04] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.76 ms [13:20:13] on the management router [13:20:13] volans: sorry that was me forgot to log [13:20:25] ok just a mr maintenance? [13:20:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:21:11] RESOLVED: [8x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:22:36] is it okay to go ahead with deployment or should I hold? [13:23:22] Lucas_WMDE: ok for me (I'm oncall), as that's only the mgmt network, nothing on the prod one [13:23:30] alright, then I’ll go ahead, thanks [13:23:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298293 (https://phabricator.wikimedia.org/T422935) (owner: 10Lucas Werkmeister (WMDE)) [13:23:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299943 (https://phabricator.wikimedia.org/T422936) (owner: 10Sadiya.mohammed13) [13:23:39] (03PS1) 10Blake: mcrouter_wancache: Swap gutterpool servers under maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/1303419 (https://phabricator.wikimedia.org/T426044) [13:23:44] I just got nervous about the big wall of critical messages ^^ [13:23:44] !log jmm@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [13:24:06] !log jmm@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [13:24:19] (03CR) 10JMeybohm: [C:04-1] helm: remove helm311 package and make helm317 default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1303367 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [13:24:27] (03Merged) 10jenkins-bot: Add Wikidata configuration for WikiProject links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298293 (https://phabricator.wikimedia.org/T422935) (owner: 10Lucas Werkmeister (WMDE)) [13:24:28] (03PS1) 10Ssingh: images/haproxy: set owner to Traffic [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1303420 [13:24:30] (03CR) 10JMeybohm: [C:03+1] helm: install helm317 and helm319 in parallel [puppet] - 10https://gerrit.wikimedia.org/r/1303368 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [13:24:32] (03Merged) 10jenkins-bot: Add instance-of WikiProject links for paintings and elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299943 (https://phabricator.wikimedia.org/T422936) (owner: 10Sadiya.mohammed13) [13:24:57] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1298293|Add Wikidata configuration for WikiProject links (T422935)]], [[gerrit:1299943|Add instance-of WikiProject links for paintings and elections (T422936)]] [13:25:02] T422935: [WIPR] Connect WikiProjects in the Tools section on relevant Item pages using properties - https://phabricator.wikimedia.org/T422935 [13:25:03] T422936: [WIPR] Connect WikiProjects in the Tools section on relevant Item pages using "instance of" statements - https://phabricator.wikimedia.org/T422936 [13:25:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:26:57] !log lucaswerkmeister-wmde@deploy1003 sadiyamohammed13, lucaswerkmeister-wmde: Backport for [[gerrit:1298293|Add Wikidata configuration for WikiProject links (T422935)]], [[gerrit:1299943|Add instance-of WikiProject links for paintings and elections (T422936)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:27:06] yerdua_wmde: please test :) [13:27:08] * Lucas_WMDE also tests [13:27:21] taking a look! [13:28:02] yup, two “WikiProject sum of all paintings” links on https://www.wikidata.org/wiki/Q22443226 :S [13:28:07] (found via `haswbstatement:P31=Q3305213 haswbstatement:P195`) [13:28:14] as I feared [13:28:24] yeah, just saw that too [13:28:42] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1005.eqiad.wmnet with OS trixie [13:28:45] okay, then roll back and ask Arian if we should go ahead with just the first config change or wait for both, I think? [13:28:46] (03CR) 10Elukey: [C:03+1] "<3" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1303420 (owner: 10Ssingh) [13:29:12] unless this can be fixed with just a change to the config, idk [13:29:23] would putting the 'propertyIds' and 'statements' in the same block work? [13:29:24] (03CR) 10Ssingh: "@hnowlan@wikimedia.org: Just as an FYI please, since you are the current owner. If you prefer to continue maintaining this, please let me " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1303420 (owner: 10Ssingh) [13:29:30] (03CR) 10JMeybohm: "There is a dashboard already at https://grafana-rw.wikimedia.org/d/bf921591-bd2b-4a87-ae20-7cc6f227e58a and search team might be using met" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303389 (owner: 10Clément Goubert) [13:29:39] oh, yeah that *should* work [13:30:06] looking at the code, I think so, yeah [13:30:19] so I’ll first roll back this deploy and revert the two changes [13:30:43] !log cmooney@cumin1003 START - Cookbook sre.network.cloud-host for host cloudvirt1068 [13:30:45] and then deploy for abijeet [13:30:45] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.cloud-host (exit_code=0) for host cloudvirt1068 [13:30:51] and then we’ll see if we have enough time left in the window for another go [13:31:03] !log cmooney@cumin1003 START - Cookbook sre.network.cloud-host for host cloudvirt1069 [13:31:04] !log lucaswerkmeister-wmde@deploy1003 sadiyamohammed13, lucaswerkmeister-wmde: Rolling back deployment [13:31:05] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.cloud-host (exit_code=0) for host cloudvirt1069 [13:31:30] !log cmooney@cumin1003 START - Cookbook sre.network.cloud-host for host cloudvirt1061 [13:31:32] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.cloud-host (exit_code=0) for host cloudvirt1061 [13:31:54] Lucas_WMDE, ready when you are :-) [13:31:56] !log cmooney@cumin1003 START - Cookbook sre.network.cloud-host for host cloudcephosd1016 [13:31:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.cloud-host (exit_code=0) for host cloudcephosd1016 [13:32:02] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Add instance-of WikiProject links for paintings and elections" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303423 [13:32:31] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Add Wikidata configuration for WikiProject links" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303424 [13:32:36] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] Revert "Add instance-of WikiProject links for paintings and elections" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303423 (owner: 10Lucas Werkmeister (WMDE)) [13:32:40] (03CR) 10CI reject: [V:04-1] Revert "Add Wikidata configuration for WikiProject links" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303424 (owner: 10Lucas Werkmeister (WMDE)) [13:32:48] (03PS2) 10Lucas Werkmeister (WMDE): Revert "Add Wikidata configuration for WikiProject links" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303424 [13:32:49] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1006.eqiad.wmnet with OS trixie [13:32:54] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] Revert "Add Wikidata configuration for WikiProject links" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303424 (owner: 10Lucas Werkmeister (WMDE)) [13:33:11] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1298293|Add Wikidata configuration for WikiProject links (T422935)]], [[gerrit:1299943|Add instance-of WikiProject links for paintings and elections (T422936)]] (duration: 08m 14s) [13:33:17] T422935: [WIPR] Connect WikiProjects in the Tools section on relevant Item pages using properties - https://phabricator.wikimedia.org/T422935 [13:33:17] T422936: [WIPR] Connect WikiProjects in the Tools section on relevant Item pages using "instance of" statements - https://phabricator.wikimedia.org/T422936 [13:33:32] (03Merged) 10jenkins-bot: Revert "Add instance-of WikiProject links for paintings and elections" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303423 (owner: 10Lucas Werkmeister (WMDE)) [13:33:49] (03Merged) 10jenkins-bot: Revert "Add Wikidata configuration for WikiProject links" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303424 (owner: 10Lucas Werkmeister (WMDE)) [13:34:18] (03CR) 10Brouberol: [C:03+1] "ooh, right" [puppet] - 10https://gerrit.wikimedia.org/r/1303406 (https://phabricator.wikimedia.org/T423314) (owner: 10Btullis) [13:34:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302739 (owner: 10Abijeet Patro) [13:34:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302743 (https://phabricator.wikimedia.org/T416512) (owner: 10Abijeet Patro) [13:34:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303010 (https://phabricator.wikimedia.org/T426532) (owner: 10Abijeet Patro) [13:34:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303009 (https://phabricator.wikimedia.org/T429145) (owner: 10Abijeet Patro) [13:34:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303323 (owner: 10Abijeet Patro) [13:34:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303297 (https://phabricator.wikimedia.org/T428778) (owner: 10Abijeet Patro) [13:34:39] (03CR) 10Huei Tan: [C:03+1] ULS rewrite: Lock scroll too, not just [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303323 (owner: 10Abijeet Patro) [13:34:43] abijeet: note that I’m pretty sure some of the relevant tasks won’t get pinged by logmsgbot [13:34:49] topranks: XioNoX: the fire in ops chanel was homer didn't create the vlan-mgmt with the ls-insterface irb.900 i don't see how to create it in netbox or homer any tips thanks [13:34:57] given that this will be way too long for a single !log message [13:36:17] (03Merged) 10jenkins-bot: ULS rewrite: Lock body scroll when open on mobile [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302739 (owner: 10Abijeet Patro) [13:36:20] papaul: tip number one is downtime first :) [13:36:37] papaul: the issue is ge-0/0/0 is not set to "mode=access" with vlan 900 as the untagged vlan [13:36:54] compare to ge-0/0/1 and ge-0/0/2 in drmrs: https://netbox.wikimedia.org/dcim/devices/3572/interfaces/ [13:37:14] (03PS2) 10Jelto: helm: remove helm311 package and make helm317 default [puppet] - 10https://gerrit.wikimedia.org/r/1303367 (https://phabricator.wikimedia.org/T341984) [13:37:17] * Lucas_WMDE was also seriously tempted to refuse deploying the “Co-Authored-By: Claude Opus 4.8” change on principle, ngl [13:37:28] oh good it’s already starting to fail in gate-and-submit [13:37:39] (03CR) 10Jelto: helm: remove helm311 package and make helm317 default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1303367 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [13:37:51] (03PS2) 10Jelto: helm: install helm317 and helm319 in parallel [puppet] - 10https://gerrit.wikimedia.org/r/1303368 (https://phabricator.wikimedia.org/T341984) [13:38:00] (03PS2) 10Blake: mcrouter_wancache: Swap gutterpool servers under maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/1303419 (https://phabricator.wikimedia.org/T426044) [13:38:03] (03CR) 10Jelto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1303367 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [13:38:08] (03CR) 10Jelto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1303368 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [13:38:16] (03PS2) 10Elukey: Add sre.hosts.bmc-user-mgmt.py [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) [13:38:56] (due to a variant of T420865, it appears) [13:38:56] T420865: Fetches from Gerrit aborted due to: GnuTLS recv error (-54): Error in the pull function - https://phabricator.wikimedia.org/T420865 [13:39:19] abijeet: do the changes have to be deployed together or would it also be okay to only deploy a subset? [13:40:55] Lucas_WMDE, would be good to deploy them together [13:41:28] ok, then we’ll have to retry some of the gate-and-submits and wait longer for that, I think [13:42:09] !log jmm@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [13:42:31] (03PS3) 10Blake: mcrouter_wancache: Swap gutterpool servers under maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/1303419 (https://phabricator.wikimedia.org/T426044) [13:42:47] (03Merged) 10jenkins-bot: ULS rewrite: Fix settings dialog width and field sizing [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302743 (https://phabricator.wikimedia.org/T416512) (owner: 10Abijeet Patro) [13:42:47] (03CR) 10Andrew Bogott: openstack: deprecate icinga check-flavor_aggregates [puppet] - 10https://gerrit.wikimedia.org/r/1302748 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [13:42:52] !log jmm@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [13:43:22] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: use ipblocks map created by hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1299940 (https://phabricator.wikimedia.org/T422249) (owner: 10Giuseppe Lavagetto) [13:43:58] at least the success cache should still speed up most of those retried builds, I think [13:44:22] (not necessarily in wall-clock time but at least in CPU time, and also reduce the chance of another failure) [13:44:23] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: deprecate icinga check-flavor_aggregates [puppet] - 10https://gerrit.wikimedia.org/r/1302748 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [13:44:32] (03Merged) 10jenkins-bot: ULS rewrite: Show variants even when no languages are available [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303010 (https://phabricator.wikimedia.org/T426532) (owner: 10Abijeet Patro) [13:44:35] (03CR) 10CI reject: [V:04-1] ULS rewrite: Capture trigger element before async module load [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303009 (https://phabricator.wikimedia.org/T429145) (owner: 10Abijeet Patro) [13:44:41] (03PS4) 10Blake: mcrouter_wancache: Swap gutterpool servers under maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/1303419 (https://phabricator.wikimedia.org/T426044) [13:44:52] (03Merged) 10jenkins-bot: ULS rewrite: Lock scroll too, not just [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303323 (owner: 10Abijeet Patro) [13:44:53] (03Merged) 10jenkins-bot: ULS rewrite: Sync the fullscreen mobile selector with a URL route [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303297 (https://phabricator.wikimedia.org/T428778) (owner: 10Abijeet Patro) [13:45:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303009 (https://phabricator.wikimedia.org/T429145) (owner: 10Abijeet Patro) [13:45:52] (03CR) 10Tiziano Fogli: [C:03+2] sloth: add reader-growth task receiver [puppet] - 10https://gerrit.wikimedia.org/r/1303411 (https://phabricator.wikimedia.org/T428617) (owner: 10Tiziano Fogli) [13:45:55] !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mr1-codfw with reason: switch refresh [13:46:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12029369 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0f25bf0f-a791-457b-ad82-68cc6bf09194) set by pt1979@cumin2002 for 1:00:0... [13:46:25] oh, they’re not stacked, I didn’t notice [13:46:33] so only one has to retry [13:46:48] !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mr1-codfw with reason: mgmt interface change [13:47:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12029377 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=113ab0a6-249c-4a59-a4d0-49f9f85ef5d6) set by pt1979@cumin2002 for 1:00:0... [13:47:08] !log mgmt interface change on mr-codfw [13:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:35] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:47:40] (03PS3) 10Elukey: Add sre.hosts.bmc-user-mgmt.py [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) [13:47:54] (but also, that one then doesn’t benefit from the success cache because the git hashes changed) [13:48:56] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-06-09-174730 to 2026-06-16-205705 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303433 (https://phabricator.wikimedia.org/T282922) [13:49:16] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-06-11-171152 to 2026-06-16-183209 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303434 (https://phabricator.wikimedia.org/T426336) [13:49:54] (03PS4) 10Elukey: Add sre.hosts.bmc-user-mgmt.py [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) [13:50:18] (03PS1) 10Audrey Penven: Add Wikidata configuration for WikiProject links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303436 (https://phabricator.wikimedia.org/T422935) [13:50:25] !log elukey@cumin1003 START - Cookbook sre.hosts.bmc-user-mgmt for host sretest[2001,2003-2004,2006,2009-2010].codfw.wmnet,sretest1005.eqiad.wmnet [13:50:56] (03PS2) 10Jforrester: wikifunctions: Switch JavaScript evaluator to Rust-based version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300271 (https://phabricator.wikimedia.org/T417870) [13:50:56] (03PS2) 10Jforrester: wikifunctions: Drop temporary Rust evaluator releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300272 (https://phabricator.wikimedia.org/T417870) [13:51:34] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.bmc-user-mgmt (exit_code=0) for host sretest[2001,2003-2004,2006,2009-2010].codfw.wmnet,sretest1005.eqiad.wmnet [13:51:39] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'. [13:52:14] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'. [13:52:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303436 (https://phabricator.wikimedia.org/T422935) (owner: 10Audrey Penven) [13:52:25] (03Merged) 10jenkins-bot: ULS rewrite: Capture trigger element before async module load [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303009 (https://phabricator.wikimedia.org/T429145) (owner: 10Abijeet Patro) [13:53:14] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1302739|ULS rewrite: Lock body scroll when open on mobile]], [[gerrit:1302743|ULS rewrite: Fix settings dialog width and field sizing (T416512)]], [[gerrit:1303010|ULS rewrite: Show variants even when no languages are available (T426532)]], [[gerrit:1303009|ULS rewrite: Capture trigger element before async module load (T429145)]], [[gerri [13:53:14] t:1303323|ULS rewrite: Lock scroll too, not just ]], [[gerrit:1303297|ULS rewrite: Sync the fullscreen mobile selector with a URL route (T428778)]] [13:53:15] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Add Wikidata configuration for WikiProject links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303436 (https://phabricator.wikimedia.org/T422935) (owner: 10Audrey Penven) [13:53:22] T416512: Enable the user to 'pin' preferred languages (for switching language easily) - https://phabricator.wikimedia.org/T416512 [13:53:22] T426532: New Universal­Language­Selector doesn't list language converter variants in MinervaNeue skin - https://phabricator.wikimedia.org/T426532 [13:53:23] T429145: ULS (new version) is displayed in the wrong place on pages with markup - https://phabricator.wikimedia.org/T429145 [13:53:23] T428778: Back button behavior in new mobile language selector - https://phabricator.wikimedia.org/T428778 [13:53:51] yup, T428778 didn’t fit in the SAL message anymore [13:54:19] we might run a bit into the wikifunctions window :/ [13:54:29] (03PS5) 10Elukey: Add sre.hosts.bmc-user-mgmt.py [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) [13:54:43] 10ops-codfw, 06DC-Ops: Power Supply - Status - issue on cirrussearch2079:9290 - https://phabricator.wikimedia.org/T429484 (10phaultfinder) 03NEW [13:54:44] 10ops-codfw, 06DC-Ops: Power Supply - PS Redundancy - issue on logstash2036:9290 - https://phabricator.wikimedia.org/T429485 (10phaultfinder) 03NEW [13:54:45] 10ops-codfw, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-ctrl2001:9290 - https://phabricator.wikimedia.org/T429486 (10phaultfinder) 03NEW [13:55:11] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, abi: Backport for [[gerrit:1302739|ULS rewrite: Lock body scroll when open on mobile]], [[gerrit:1302743|ULS rewrite: Fix settings dialog width and field sizing (T416512)]], [[gerrit:1303010|ULS rewrite: Show variants even when no languages are available (T426532)]], [[gerrit:1303009|ULS rewrite: Capture trigger element before async module load (T429145)]], [[ge [13:55:12] rrit:1303323|ULS rewrite: Lock scroll too, not just ]], [[gerrit:1303297|ULS rewrite: Sync the fullscreen mobile selector with a URL route (T428778)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:55:15] abijeet: please test [13:55:16] (03PS6) 10Elukey: Add sre.hosts.bmc-user-mgmt.py [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) [13:55:55] 06SRE, 06Infrastructure-Foundations, 10netops: Create cookbook to add BGP peering for host by triggering Homer run on correct device - https://phabricator.wikimedia.org/T429488 (10cmooney) 03NEW p:05Triage→03Low [13:56:01] Lucas_WMDE, on it [13:56:18] (03CR) 10Elukey: "Hey folks, I used the old ipmi-password-reset cookbook as baseline to create a new one. The idea is to have something that enforces the cu" [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [13:56:29] (03CR) 10Btullis: [C:03+2] Bring dse-k8s-wdqs100[1-3] into service [puppet] - 10https://gerrit.wikimedia.org/r/1303406 (https://phabricator.wikimedia.org/T423314) (owner: 10Btullis) [13:58:13] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [13:58:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-wdqs2004.codfw.wmnet with OS bookworm [13:58:16] !log btullis@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [13:58:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-wdqs2003.codfw.wmnet with OS bookworm [13:58:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#12029490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cu... [13:58:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#12029491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cu... [13:59:10] PROBLEM - Host an-druid1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:59:14] (03PS5) 10Blake: mcrouter_wancache: Swap gutterpool servers under maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/1303419 (https://phabricator.wikimedia.org/T426044) [13:59:52] (03PS6) 10Blake: mcrouter_wancache: Swap gutterpool servers under maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/1303419 (https://phabricator.wikimedia.org/T426044) [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T1400) [14:00:14] still deploying, sorry [14:00:32] (03CR) 10Milazg: [C:03+1] REST: Adjust key of Reading Lists OpenAPI spec in RestSandboxSpecs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303004 (https://phabricator.wikimedia.org/T422771) (owner: 10BPirkle) [14:00:56] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/admin 'sync'. [14:01:13] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [14:01:27] (03PS7) 10Blake: mcrouter_wancache: Swap gutterpool servers under maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/1303419 (https://phabricator.wikimedia.org/T426044) [14:02:32] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade evaluators from 2026-06-09-174730 to 2026-06-16-205705 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303433 (https://phabricator.wikimedia.org/T282922) (owner: 10Jforrester) [14:03:25] (03CR) 10Gmodena: dse-k8s-services: WDQS deployment helmfile values (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [14:04:39] FIRING: [10x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-wdqs1001:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [14:04:47] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-06-09-174730 to 2026-06-16-205705 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303433 (https://phabricator.wikimedia.org/T282922) (owner: 10Jforrester) [14:05:02] FIRING: [10x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-wdqs1001:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [14:06:11] !log ecarg@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:06:47] abijeet: sorry to poke, but are you still testing? [14:07:32] Lucas_WMDE, yes. I was able to test all but 1 patch: 1303010: ULS rewrite: Show variants even when no languages are available | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/UniversalLanguageSelector/+/1303010 [14:08:06] ok [14:08:18] does that mean you’re testing the last one now or should I continue with the deploy? [14:08:23] (03CR) 10Effie Mouzeli: [C:03+1] mcrouter_wancache: Swap gutterpool servers under maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/1303419 (https://phabricator.wikimedia.org/T426044) (owner: 10Blake) [14:08:29] I'm not sure I can test it until it until 1.47.0-wmf.7 has rolled out to group1 or group2 wikis [14:08:37] Lucas_WMDE, I think we can go ahead with the deploy [14:08:39] (03PS1) 10Marostegui: major-upgrade.py: Add !log dbmaint on the start [cookbooks] - 10https://gerrit.wikimedia.org/r/1303438 [14:08:42] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, abi: Continuing with deployment [14:08:44] alright, thanks [14:08:47] (03CR) 10Blake: [C:03+2] mcrouter_wancache: Swap gutterpool servers under maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/1303419 (https://phabricator.wikimedia.org/T426044) (owner: 10Blake) [14:09:39] RESOLVED: [10x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-wdqs1001:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [14:10:02] Lucas_WMDE, thanks. [14:11:38] !log btullis@puppetserver1001 conftool action : set/weight=10; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-wdqs*.eqiad.wmnet [14:11:45] (03CR) 10KineticPelagic: [C:03+1] "Consistency FTW!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303004 (https://phabricator.wikimedia.org/T422771) (owner: 10BPirkle) [14:11:59] !log btullis@puppetserver1001 conftool action : set/weight=10; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-wdqs1001.eqiad.wmnet [14:12:03] !log btullis@puppetserver1001 conftool action : set/weight=10; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-wdqs1002.eqiad.wmnet [14:12:07] !log btullis@puppetserver1001 conftool action : set/weight=10; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-wdqs1003.eqiad.wmnet [14:12:13] !log btullis@puppetserver1001 conftool action : set/weight=10; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-wdqs-test1001.eqiad.wmnet [14:12:27] !log btullis@puppetserver1001 conftool action : set/pooled=yes; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-wdqs1001.eqiad.wmnet [14:12:30] !log btullis@puppetserver1001 conftool action : set/pooled=yes; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-wdqs1002.eqiad.wmnet [14:12:34] !log btullis@puppetserver1001 conftool action : set/pooled=yes; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-wdqs1003.eqiad.wmnet [14:12:43] !log btullis@puppetserver1001 conftool action : set/pooled=yes; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-wdqs-test1001.eqiad.wmnet [14:12:48] (03PS1) 10Papaul: change back interface to ge-0/0/0 reboot needed [homer/public] - 10https://gerrit.wikimedia.org/r/1303441 (https://phabricator.wikimedia.org/T421674) [14:12:59] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302739|ULS rewrite: Lock body scroll when open on mobile]], [[gerrit:1302743|ULS rewrite: Fix settings dialog width and field sizing (T416512)]], [[gerrit:1303010|ULS rewrite: Show variants even when no languages are available (T426532)]], [[gerrit:1303009|ULS rewrite: Capture trigger element before async module load (T429145)]], [[gerr [14:12:59] it:1303323|ULS rewrite: Lock scroll too, not just ]], [[gerrit:1303297|ULS rewrite: Sync the fullscreen mobile selector with a URL route (T428778)]] (duration: 19m 44s) [14:13:05] T416512: Enable the user to 'pin' preferred languages (for switching language easily) - https://phabricator.wikimedia.org/T416512 [14:13:06] T426532: New Universal­Language­Selector doesn't list language converter variants in MinervaNeue skin - https://phabricator.wikimedia.org/T426532 [14:13:06] T429145: ULS (new version) is displayed in the wrong place on pages with markup - https://phabricator.wikimedia.org/T429145 [14:13:07] T428778: Back button behavior in new mobile language selector - https://phabricator.wikimedia.org/T428778 [14:13:10] !log UTC afternoon backport+config window done [14:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:24] yerdua_wmde: maybe we can retry the config change later today, otherwise tomorrow I guess [14:13:35] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:13:47] (deployment calendar is free 17:00–19:00 CEST, though I have a 1:1 until 17:30) [14:13:52] sounds good [14:13:55] !log add basic Kafka ACLs for anonymous to logging-eqiad - T425528 [14:13:57] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2048: Upgrading es2048.codfw.wmnet [14:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:59] T425528: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528 [14:14:29] PROBLEM - Host an-presto1020 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2048: Upgrading es2048.codfw.wmnet [14:14:57] RECOVERY - Host an-presto1020 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [14:14:57] I have to end my day at 18:30 CEST. so, if it works to do it before then, I'll be around [14:15:06] otherwise, tomorrow [14:15:57] James_F: maybe you could ping us if you’re done with the Wikifunctions window early? (no problem if not) [14:16:07] 06SRE, 06Infrastructure-Foundations, 10netops: Create cookbook to add BGP peering for host by triggering Homer run on correct device - https://phabricator.wikimedia.org/T429488#12029587 (10cmooney) [14:16:07] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:16:07] Lucas_WMDE: We're services-only if you want to do MW stuff. [14:16:10] (03CR) 10Papaul: [C:03+2] change back interface to ge-0/0/0 reboot needed [homer/public] - 10https://gerrit.wikimedia.org/r/1303441 (https://phabricator.wikimedia.org/T421674) (owner: 10Papaul) [14:16:17] 06SRE, 06Infrastructure-Foundations, 10netops: Create cookbook to add BGP peering for host by triggering Homer run on correct device - https://phabricator.wikimedia.org/T429488#12029589 (10cmooney) [14:16:18] ok, then let’s try it now, thanks [14:16:20] (cc yerdua_wmde) [14:16:34] !log ecarg@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:16:46] ok, I'm down to try now [14:16:50] 06SRE, 06Infrastructure-Foundations, 10netops: Create cookbook to add BGP peering for host by triggering Homer run on correct device - https://phabricator.wikimedia.org/T429488#12029594 (10BTullis) Thanks very much. I can see this as being extremely useful for us in #data-platform-sre [14:16:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12029593 (10Papaul) I Changed back the configuration on mr1-codfw for the irb-900 interface since a reboot is needed. I will schedule a maintenance... [14:16:57] !log blake@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [14:17:00] (03PS2) 10Marostegui: major-upgrade.py: Add !log dbmaint on the start [cookbooks] - 10https://gerrit.wikimedia.org/r/1303438 [14:17:07] !log blake@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [14:17:07] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:17:14] !log blake@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [14:17:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303436 (https://phabricator.wikimedia.org/T422935) (owner: 10Audrey Penven) [14:17:19] !log blake@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [14:17:22] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2048.codfw.wmnet with OS trixie [14:17:44] (03PS1) 10Arnaudb: gerrit: fix linting problems [alerts] - 10https://gerrit.wikimedia.org/r/1303444 [14:17:48] (03CR) 10Arnaudb: [C:03+2] gerrit: fix linting problems [alerts] - 10https://gerrit.wikimedia.org/r/1303444 (owner: 10Arnaudb) [14:18:20] (03Merged) 10jenkins-bot: Add Wikidata configuration for WikiProject links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303436 (https://phabricator.wikimedia.org/T422935) (owner: 10Audrey Penven) [14:18:33] 06SRE, 06Infrastructure-Foundations, 10netops: Create cookbook to add BGP peering for host by triggering Homer run on correct device - https://phabricator.wikimedia.org/T429488#12029612 (10ayounsi) Yep it's a good idea, but I think we will soon be there ! For decom: https://gerrit.wikimedia.org/r/c/operatio... [14:18:49] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1303436|Add Wikidata configuration for WikiProject links (T422935 T422936)]] [14:18:54] T422935: [WIPR] Connect WikiProjects in the Tools section on relevant Item pages using properties - https://phabricator.wikimedia.org/T422935 [14:18:55] T422936: [WIPR] Connect WikiProjects in the Tools section on relevant Item pages using "instance of" statements - https://phabricator.wikimedia.org/T422936 [14:19:02] !log cdobbins@cumin1003 conftool action : set/pooled=no; selector: name=dns7002.* [14:19:47] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc-gp1004.eqiad.wmnet with OS trixie [14:20:01] (03Merged) 10jenkins-bot: gerrit: fix linting problems [alerts] - 10https://gerrit.wikimedia.org/r/1303444 (owner: 10Arnaudb) [14:20:47] !log lucaswerkmeister-wmde@deploy1003 audreypenven, lucaswerkmeister-wmde: Backport for [[gerrit:1303436|Add Wikidata configuration for WikiProject links (T422935 T422936)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:21:03] !log depooling dns7002 to attempt reimage to trixie [14:21:03] yerdua_wmde: please test :) [14:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:15] looking [14:21:16] I only see one link on https://www.wikidata.org/wiki/Q22443226, yay [14:21:33] yay [14:21:58] I also only see one on a page where I saw a duplicate [14:22:00] and https://www.wikidata.org/wiki/Q104533829 has one via P31 despite not having a collection statement, so I think that’s also working [14:22:05] so, I think it's good [14:22:13] yeah I think we don’t have to test every property [14:22:14] let’s go :) [14:22:19] !log lucaswerkmeister-wmde@deploy1003 audreypenven, lucaswerkmeister-wmde: Continuing with deployment [14:24:02] 10SRE-swift-storage, 06Commons, 06DBA, 10media-backups, and 2 others: old file revisions missing of File:A_Warm_Shade_of_Ivory_-_Henry_Mancini_album_cover.jpg - https://phabricator.wikimedia.org/T428406#12029630 (10Zabe) a:03Zabe Will check [14:24:39] FIRING: [7x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-wdqs1001:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [14:25:04] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#12029635 (10Scott_French) Amazing - thank you very much, @elukey! (and duly noted about the unmerged cookbook changes) [14:25:51] (03CR) 10Trueg: [C:03+2] dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [14:26:38] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1303436|Add Wikidata configuration for WikiProject links (T422935 T422936)]] (duration: 07m 49s) [14:26:44] T422935: [WIPR] Connect WikiProjects in the Tools section on relevant Item pages using properties - https://phabricator.wikimedia.org/T422935 [14:26:44] T422936: [WIPR] Connect WikiProjects in the Tools section on relevant Item pages using "instance of" statements - https://phabricator.wikimedia.org/T422936 [14:27:00] * Lucas_WMDE done deploying [14:28:01] (03CR) 10Ottomata: [C:03+1] "One nit, but resolve as you see fit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302923 (https://phabricator.wikimedia.org/T429380) (owner: 10Lerickson) [14:28:08] !log cdobbins@cumin1003 START - Cookbook sre.hosts.reimage for host dns7002.wikimedia.org with OS trixie [14:28:09] (03Merged) 10jenkins-bot: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [14:28:23] (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303349 (https://phabricator.wikimedia.org/T425336) (owner: 10JavierMonton) [14:29:18] (03CR) 10DCausse: [C:03+1] deployment-prep: Update cirrussearch (OpenSearch) config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302956 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [14:29:39] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-wdqs2001:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [14:30:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T1400) [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T1430) [14:31:05] (03PS6) 10Clare Ming: Add phabricator api token for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) [14:31:59] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:32:16] (03PS7) 10Clare Ming: Add phabricator api token for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) [14:32:16] (03PS1) 10Ayounsi: Add Flow panel [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/1303451 [14:33:03] (03CR) 10Muehlenhoff: [C:03+2] Update role contacts in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1291711 (owner: 10Muehlenhoff) [14:33:09] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2048.codfw.wmnet with reason: host reimage [14:33:10] FIRING: [2x] BFDdown: BFD session down between asw1-b4-magru and 195.200.68.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b4-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:33:46] (03CR) 10JMeybohm: [C:03+1] helm: remove helm311 package and make helm317 default [puppet] - 10https://gerrit.wikimedia.org/r/1303367 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [14:33:48] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303349 (https://phabricator.wikimedia.org/T425336) (owner: 10JavierMonton) [14:34:09] (03CR) 10CI reject: [V:04-1] Add phabricator api token for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) (owner: 10Clare Ming) [14:35:22] (03PS1) 10Ayounsi: Remove worldmap panel [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/1303453 [14:35:58] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1004.eqiad.wmnet with reason: host reimage [14:36:01] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303349 (https://phabricator.wikimedia.org/T425336) (owner: 10JavierMonton) [14:36:17] (03PS1) 10Effie Mouzeli: WIP: Create an llms.txt where honest robots can read our API Policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303454 (https://phabricator.wikimedia.org/T426157) [14:37:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303012 (owner: 10Abijeet Patro) [14:38:53] 06SRE, 10Maps, 06Traffic: Possibility to allow Wikimedia Maps usage on all Wikibase Cloud instances - https://phabricator.wikimedia.org/T429191#12029719 (10ssingh) @MSantos: this needs your approval. [14:39:39] FIRING: [2x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:39] FIRING: [2x] SystemdUnitFailed: cowbuilder_update_bookworm-amd64.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:43] FIRING: [6x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-wdqs2001:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [14:39:58] (03CR) 10Ssingh: C:dumps::web::xmldumps block generic user-agents (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297102 (https://phabricator.wikimedia.org/T427836) (owner: 10Slyngshede) [14:40:16] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2048.codfw.wmnet with reason: host reimage [14:44:39] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1004.eqiad.wmnet with reason: host reimage [14:44:50] (03PS1) 10Jforrester: Revert "wikifunctions: Upgrade evaluators from 2026-06-09-174730 to 2026-06-16-205705" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303456 [14:44:58] (03CR) 10Jforrester: [C:03+2] Revert "wikifunctions: Upgrade evaluators from 2026-06-09-174730 to 2026-06-16-205705" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303456 (owner: 10Jforrester) [14:46:49] RECOVERY - Host an-druid1005 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [14:46:53] PROBLEM - Druid broker on an-druid1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:46:55] PROBLEM - Druid coordinator on an-druid1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:46:55] PROBLEM - Druid historical on an-druid1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:46:55] PROBLEM - Druid overlord on an-druid1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:47:34] (03Merged) 10jenkins-bot: Revert "wikifunctions: Upgrade evaluators from 2026-06-09-174730 to 2026-06-16-205705" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303456 (owner: 10Jforrester) [14:48:47] (03CR) 10Jforrester: [C:04-2] wikifunctions: Upgrade orchestrator from 2026-06-11-171152 to 2026-06-16-183209 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303434 (https://phabricator.wikimedia.org/T426336) (owner: 10Jforrester) [14:50:16] (03PS3) 10Giuseppe Lavagetto: haproxy: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/1299941 [14:51:51] PROBLEM - SSH on logstash1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:51:53] RECOVERY - Druid broker on an-druid1005 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:51:55] RECOVERY - Druid coordinator on an-druid1005 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:51:55] RECOVERY - Druid overlord on an-druid1005 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:51:55] RECOVERY - Druid historical on an-druid1005 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:52:31] FIRING: [2x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1023:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:52:41] RECOVERY - SSH on logstash1023 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:54:21] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/1299941 (owner: 10Giuseppe Lavagetto) [14:55:56] (03PS1) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303458 (https://phabricator.wikimedia.org/T421237) [14:56:38] !log cdobbins@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dns7002.wikimedia.org with reason: host reimage [14:56:49] RECOVERY - Host an-master1004 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [14:57:10] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Terminal configuration for cookbooks - https://phabricator.wikimedia.org/T429129#12029806 (10LSobanski) a:03jhathaway [14:57:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2048.codfw.wmnet with OS trixie [14:57:31] RESOLVED: [2x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1023:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:58:03] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Terminal configuration for cookbooks - https://phabricator.wikimedia.org/T429129#12029810 (10jhathaway) @MoritzMuehlenhoff what do you think of the patch? Or do you want to find a way to retain the colors? [14:59:50] !log cdobbins@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns7002.wikimedia.org with reason: host reimage [15:00:57] !log aokoth@deploy1003 Started deploy [phabricator/deployment@a640ed9]: deploy phab [15:01:08] 10SRE-swift-storage, 06Commons: Compressing TIFF files from the Library of Congress - https://phabricator.wikimedia.org/T429264#12029822 (10Yann) OK, but the question is: How much space saved if the files were compressed? [15:02:21] !log aokoth@deploy1003 Finished deploy [phabricator/deployment@a640ed9]: deploy phab (duration: 01m 24s) [15:02:49] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Terminal configuration for cookbooks - https://phabricator.wikimedia.org/T429129#12029827 (10Volans) One quick fix should be to just remove sudo AFAIUI. That said, if in the longer term we still aim to go towards unprivileged cumin that... [15:03:14] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1004.eqiad.wmnet with OS trixie [15:03:48] (03PS1) 10Snwachukwu: Sqoop Mediawiki: Block monthly sqoop jobs on ingestion_wikis success flag. [puppet] - 10https://gerrit.wikimedia.org/r/1303460 (https://phabricator.wikimedia.org/T425385) [15:05:28] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:05:57] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:06:32] PROBLEM - Recursive DNS on 195.200.68.37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [15:07:55] cjd91: this is the reimage of dns7002 right? ^^^ [15:08:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2048: Migration of es2048.codfw.wmnet completed [15:09:21] volans: yes [15:10:47] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8749/co" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [15:10:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303004 (https://phabricator.wikimedia.org/T422771) (owner: 10BPirkle) [15:11:02] (03CR) 10Cathal Mooney: [C:03+1] "LGTM if it is acceptable to the obvervability folks. Be a great thing to have :)" [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/1303451 (owner: 10Ayounsi) [15:11:32] PROBLEM - Recursive DNS on 2a02:ec80:700:2:195:200:68:37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [15:11:33] ack thx [15:12:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [15:15:14] 06SRE, 06Infrastructure-Foundations, 10netops: SR-Linux: applying analytics-in acl to irb sub-interface blocks ARP - https://phabricator.wikimedia.org/T429499 (10cmooney) 03NEW p:05Triage→03High [15:17:14] (03PS1) 10Cathal Mooney: SR-Linux: do not attach ACLs to interfaces for now [homer/public] - 10https://gerrit.wikimedia.org/r/1303465 (https://phabricator.wikimedia.org/T429499) [15:17:49] (03CR) 10BCornwall: [C:03+2] Create sre.cdn.roll-restart-purged [cookbooks] - 10https://gerrit.wikimedia.org/r/1302230 (owner: 10BCornwall) [15:19:08] (03PS3) 10Sbisson: Enable ULS v2 on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303012 (owner: 10Abijeet Patro) [15:19:39] FIRING: [2x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:21:04] (03CR) 10Ayounsi: [C:03+1] SR-Linux: do not attach ACLs to interfaces for now [homer/public] - 10https://gerrit.wikimedia.org/r/1303465 (https://phabricator.wikimedia.org/T429499) (owner: 10Cathal Mooney) [15:22:31] (03CR) 10Fabfur: cache::haproxy: using intermediate variable for logging x-provenance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1302874 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [15:22:35] (03Merged) 10jenkins-bot: Create sre.cdn.roll-restart-purged [cookbooks] - 10https://gerrit.wikimedia.org/r/1302230 (owner: 10BCornwall) [15:22:47] (03CR) 10Cathal Mooney: [C:03+2] SR-Linux: do not attach ACLs to interfaces for now [homer/public] - 10https://gerrit.wikimedia.org/r/1303465 (https://phabricator.wikimedia.org/T429499) (owner: 10Cathal Mooney) [15:22:47] (03Abandoned) 10Fabfur: cache::haproxy: using intermediate variable for logging x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1302874 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [15:24:13] (03Merged) 10jenkins-bot: SR-Linux: do not attach ACLs to interfaces for now [homer/public] - 10https://gerrit.wikimedia.org/r/1303465 (https://phabricator.wikimedia.org/T429499) (owner: 10Cathal Mooney) [15:26:11] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1007.eqiad.wmnet with OS trixie [15:27:45] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [15:28:14] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12029977 (10cmooney) >>! In T425088#12029028, @elukey wrote: > @cmooney cloudvirt1078.eqiad.wmnet leads to a NX domain, but I see something provisioned in netbox: https://n... [15:29:07] (03PS3) 10AOkoth: hiera: promote phab2003 to passive_server [puppet] - 10https://gerrit.wikimedia.org/r/1302894 (https://phabricator.wikimedia.org/T423727) [15:30:44] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:32:02] (03PS1) 10Fabfur: cache::haproxy: changing req.provenance to sess.provenance and log [puppet] - 10https://gerrit.wikimedia.org/r/1303473 (https://phabricator.wikimedia.org/T427068) [15:32:46] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Terminal configuration for cookbooks - https://phabricator.wikimedia.org/T429129#12030015 (10jhathaway) >>! In T429129#12029827, @Volans wrote: > One quick fix should be to just remove sudo AFAIUI. > That said, if in the longer term we s... [15:35:03] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1303473 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [15:37:31] (03CR) 10Dzahn: [C:03+1] "This should be fine. Though generally I would not tear down existing host before new host is verified to be working. I would expect some m" [puppet] - 10https://gerrit.wikimedia.org/r/1302894 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [15:42:21] (03PS1) 10JMeybohm: Update istio to 1.29.4 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1303475 (https://phabricator.wikimedia.org/T427401) [15:42:26] (03PS1) 10Andrew Bogott: Keystone: Reenable creation of trusts via application-credentials [puppet] - 10https://gerrit.wikimedia.org/r/1303476 [15:42:43] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12030073 (10elukey) Ahhh okok! I see all working now thanks! [15:42:48] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cloudvirt1078.eqiad.wmnet with OS trixie [15:43:05] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Terminal configuration for cookbooks - https://phabricator.wikimedia.org/T429129#12030077 (10Volans) My understanding is that it's set to dumb when not in a PTY: ` mylaptop $ ssh cumin1003.eqiad.wmnet 'echo $TERM' dumb mylaptop $ ssh -t... [15:44:50] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Terminal configuration for cookbooks - https://phabricator.wikimedia.org/T429129#12030078 (10MoritzMuehlenhoff) >>! In T429129#12029810, @jhathaway wrote: > @MoritzMuehlenhoff what do you think of the patch? Or do you want to find a way... [15:45:42] (03CR) 10RLazarus: [C:03+2] test_cli: Update assertEquals to assertEqual [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302988 (owner: 10RLazarus) [15:45:47] (03CR) 10RLazarus: [C:03+2] tox: Bump flake8 to 7.3.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302989 (owner: 10RLazarus) [15:45:48] (03CR) 10RLazarus: [C:03+2] tox: Test up to Python 3.14 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302990 (owner: 10RLazarus) [15:45:49] (03CR) 10RLazarus: [C:03+2] builder: Fix type error and unpin mypy version [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302999 (owner: 10RLazarus) [15:45:50] (03CR) 10RLazarus: [C:03+2] Release 4.0.5 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302991 (owner: 10RLazarus) [15:46:15] (03PS1) 10Blake: mcrouter_wancache: Bring mc-gp1004 back into use. [puppet] - 10https://gerrit.wikimedia.org/r/1303469 (https://phabricator.wikimedia.org/T426044) [15:46:25] !log installing python-ldap security updates [15:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:25] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1007.eqiad.wmnet with reason: host reimage [15:48:35] (03Merged) 10jenkins-bot: test_cli: Update assertEquals to assertEqual [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302988 (owner: 10RLazarus) [15:48:36] (03Merged) 10jenkins-bot: tox: Bump flake8 to 7.3.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302989 (owner: 10RLazarus) [15:49:24] (03CR) 10Andrew Bogott: [C:03+2] Keystone: Reenable creation of trusts via application-credentials [puppet] - 10https://gerrit.wikimedia.org/r/1303476 (owner: 10Andrew Bogott) [15:49:51] 06SRE, 06Infrastructure-Foundations, 06Traffic: Scaling urldownloaders by adding redundancy and load balancing - https://phabricator.wikimedia.org/T429175#12030100 (10CDanis) There's a hidden Option 4 here, which is to declare that urldownloader would be the first Sophroid-only service, only accessible via t... [15:50:35] (03Merged) 10jenkins-bot: tox: Test up to Python 3.14 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302990 (owner: 10RLazarus) [15:50:36] (03Merged) 10jenkins-bot: builder: Fix type error and unpin mypy version [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302999 (owner: 10RLazarus) [15:50:37] (03Merged) 10jenkins-bot: Release 4.0.5 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302991 (owner: 10RLazarus) [15:51:26] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12030108 (10MoritzMuehlenhoff) [15:53:40] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1078.eqiad.wmnet with reason: host reimage [15:54:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2048: Migration of es2048.codfw.wmnet completed [15:54:14] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [15:54:39] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1007.eqiad.wmnet with reason: host reimage [15:56:43] PROBLEM - Gitlab HTTPS SSL Expiry on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [15:56:53] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [15:57:43] RECOVERY - Gitlab HTTPS SSL Expiry on gitlab.wikimedia.org is OK: OK - Certificate gitlab.wikimedia.org will expire on Tue 01 Sep 2026 09:02:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [15:57:56] FIRING: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:58:44] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1078.eqiad.wmnet with reason: host reimage [15:58:54] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 28690 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [15:59:17] jelto: o/ is there any maintenance ongoing for gitlab? [15:59:20] I didn't see it sal [16:00:08] There was an accidental gitlab upgrade, I'm at the computer in 5 mins :) [16:00:15] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-purged rolling restart_daemons on A:cp and not P{cp7001.magru.wmnet} and A:cp [16:00:24] jelto: ah okok lemme know if you need any help! [16:00:26] elukey: sorry, that was me [16:00:26] cc: volans [16:00:33] gitlab should be back up [16:00:46] moritzm: no problem, I just see it passing by and I wanted to make sure it was all ok [16:01:12] pebkac on my end, I only meant to upgrade python-ldap, but then also updated gitlab alongside [16:01:16] (03PS1) 10Volans: cloudnfs: add dumps to language project [puppet] - 10https://gerrit.wikimedia.org/r/1303479 (https://phabricator.wikimedia.org/T429433) [16:01:39] I'm here if neeeded [16:02:17] volans: seems like everything worked out fine [16:02:27] it seems fine now:) [16:02:28] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin depooling P{lvs5005.eqsin.wmnet} and A:liberica [16:02:43] upgrade to next major version was planned already and done on test hosts [16:02:49] it just happened a bit earlier now, heh [16:02:56] RESOLVED: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:02:59] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs5005.eqsin.wmnet} and A:liberica [16:03:30] (03PS1) 10CDobbins: hieradata: override dns7002; use correct cfg file [puppet] - 10https://gerrit.wikimedia.org/r/1303481 (https://phabricator.wikimedia.org/T401832) [16:04:14] yeah thank you for running the upgrade :D I'll check if GitLab looks good and also upgrade the Runners, so they are on the same version [16:04:16] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#12030171 (10VRiley-WMF) Checking now, sorry I was away on vacation [16:04:18] (03CR) 10Ssingh: [C:03+1] hieradata: override dns7002; use correct cfg file [puppet] - 10https://gerrit.wikimedia.org/r/1303481 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [16:04:20] * jelto at the computer now [16:04:31] (03CR) 10Filippo Giunchedi: [C:03+1] cloudnfs: add dumps to language project [puppet] - 10https://gerrit.wikimedia.org/r/1303479 (https://phabricator.wikimedia.org/T429433) (owner: 10Volans) [16:04:39] FIRING: [3x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:06] (03CR) 10Volans: [C:03+2] cloudnfs: add dumps to language project [puppet] - 10https://gerrit.wikimedia.org/r/1303479 (https://phabricator.wikimedia.org/T429433) (owner: 10Volans) [16:05:35] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-ctrl2006.mgmt:22 - https://phabricator.wikimedia.org/T429283#12030185 (10Jhancock.wm) 05Open→03Resolved [16:05:55] (03CR) 10CDobbins: [C:03+2] hieradata: override dns7002; use correct cfg file [puppet] - 10https://gerrit.wikimedia.org/r/1303481 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [16:08:31] (03CR) 10CDanis: [C:03+1] cache::haproxy: changing req.provenance to sess.provenance and log [puppet] - 10https://gerrit.wikimedia.org/r/1303473 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [16:08:51] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-06-09-174730 to 2026-06-17-154210 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303482 (https://phabricator.wikimedia.org/T282922) [16:10:00] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-06-11-171152 to 2026-06-16-183209 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303434 (https://phabricator.wikimedia.org/T426336) [16:11:49] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [16:12:02] (03CR) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-06-11-171152 to 2026-06-16-183209 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303434 (https://phabricator.wikimedia.org/T426336) (owner: 10Jforrester) [16:13:39] (03CR) 10Volans: [C:03+2] ceph: allow to set client transport encryption [puppet] - 10https://gerrit.wikimedia.org/r/1302904 (https://phabricator.wikimedia.org/T294432) (owner: 10Volans) [16:14:54] elukey@cumin1003 reimage (PID 497715) is awaiting input [16:15:39] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [16:15:39] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1007.eqiad.wmnet with OS trixie [16:15:44] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs5005.eqsin.wmnet with OS bookworm [16:16:04] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [16:16:22] !log brett@cumin2002 START - Cookbook sre.hosts.move-vlan for host lvs5005 [16:16:22] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [16:16:22] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1078.eqiad.wmnet with OS trixie [16:18:23] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12030381 (10elukey) Just reimaged cloudvirt1078, I think all hosts are ready now! Please note: the provision/reimage changes to make this happen are not merged yet, I test... [16:18:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#12030387 (10elukey) Just reimaged kafka-logging1007, I think we are done! Please note: the provision/reimage changes to make this happen are not merged yet, I test-cookboo... [16:19:25] brett@cumin2002 reimage (PID 3939263) is awaiting input [16:19:50] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on logstash2036:9290 - https://phabricator.wikimedia.org/T429485#12030395 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:20:44] (03PS1) 10BCornwall: lvs5005: Set lowest bgp priority during reimage [puppet] - 10https://gerrit.wikimedia.org/r/1303484 (https://phabricator.wikimedia.org/T428229) [16:21:00] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#12030408 (10VRiley-WMF) @Marostegui So, currently, the CPU in DB1224 is an Intel Xeon Gold 5317 (Server is R650xs), I checked the other units and they are all Dell R440's with Intel Gold 5217, which do... [16:21:36] (03CR) 10Volans: [C:03+2] Cinder backups: enable transport encryption part 1 [puppet] - 10https://gerrit.wikimedia.org/r/1302905 (https://phabricator.wikimedia.org/T294432) (owner: 10Volans) [16:22:02] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8750/co" [puppet] - 10https://gerrit.wikimedia.org/r/1303484 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [16:23:22] (03CR) 10CDobbins: [C:03+1] lvs5005: Set lowest bgp priority during reimage [puppet] - 10https://gerrit.wikimedia.org/r/1303484 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [16:23:55] (03CR) 10BCornwall: [V:03+1 C:03+2] lvs5005: Set lowest bgp priority during reimage [puppet] - 10https://gerrit.wikimedia.org/r/1303484 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [16:26:30] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on cirrussearch2079:9290 - https://phabricator.wikimedia.org/T429484#12030434 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:27:00] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-ctrl2001:9290 - https://phabricator.wikimedia.org/T429486#12030436 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:27:01] (03PS1) 10BCornwall: common: Update lvs5005's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1303485 (https://phabricator.wikimedia.org/T428229) [16:27:20] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on cirrussearch2080:9290 - https://phabricator.wikimedia.org/T429448#12030441 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:29:43] (03CR) 10Scott French: cache::haproxy: changing req.provenance to sess.provenance and log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1303473 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [16:31:17] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2247 - https://phabricator.wikimedia.org/T429348#12030467 (10Jhancock.wm) Dell kicked it back (surprise) and asked me to try reseating the drive. it caused new errors to pop up (surprise again). resubmitting my request [16:33:36] (03CR) 10CDobbins: [C:03+1] common: Update lvs5005's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1303485 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [16:33:42] (03CR) 10BCornwall: [C:03+2] common: Update lvs5005's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1303485 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [16:34:27] (03PS2) 10Lerickson: EventStreamConfig: add stream for WDQS V2 external queries. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302923 (https://phabricator.wikimedia.org/T429380) [16:34:43] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2247 - https://phabricator.wikimedia.org/T429348#12030478 (10Marostegui) That's surprising that they push back on a degraded array [16:36:05] jouncebot: nowandnext [16:36:05] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [16:36:05] In 0 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T1700) [16:36:11] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#12030484 (10Marostegui) >>! In T427535#12030408, @VRiley-WMF wrote: > @Marostegui > > So, currently, the CPU in DB1224 is an Intel Xeon Gold 5317 (Server is R650xs), I checked the other units and they a... [16:36:20] (03CR) 10Lerickson: "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302923 (https://phabricator.wikimedia.org/T429380) (owner: 10Lerickson) [16:37:55] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303458 (https://phabricator.wikimedia.org/T421237) (owner: 10DCausse) [16:38:12] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2247 - https://phabricator.wikimedia.org/T429348#12030485 (10Jhancock.wm) this last year, they push back on a lot of things they didn't use to. [16:38:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:38:25] (03CR) 10Ottomata: [C:03+1] EventStreamConfig: add stream for WDQS V2 external queries. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302923 (https://phabricator.wikimedia.org/T429380) (owner: 10Lerickson) [16:38:27] (03PS1) 10Volans: WMCS cinder backups: adjust retention [puppet] - 10https://gerrit.wikimedia.org/r/1303487 (https://phabricator.wikimedia.org/T428867) [16:38:48] (03CR) 10JHathaway: [C:03+2] puppet-merge: disable colors if we don't have a tty [puppet] - 10https://gerrit.wikimedia.org/r/1302262 (https://phabricator.wikimedia.org/T429129) (owner: 10JHathaway) [16:38:49] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Terminal configuration for cookbooks - https://phabricator.wikimedia.org/T429129#12030488 (10jhathaway) >>! In T429129#12030077, @Volans wrote: > My understanding is that it's set to dumb when not in a PTY: > > ` > mylaptop $ ssh cumin1... [16:39:07] (03CR) 10CI reject: [V:04-1] WMCS cinder backups: adjust retention [puppet] - 10https://gerrit.wikimedia.org/r/1303487 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [16:39:55] !log brett@cumin2002 START - Cookbook sre.dns.netbox [16:40:12] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303458 (https://phabricator.wikimedia.org/T421237) (owner: 10DCausse) [16:40:49] (03PS2) 10Volans: WMCS cinder backups: adjust retention [puppet] - 10https://gerrit.wikimedia.org/r/1303487 (https://phabricator.wikimedia.org/T428867) [16:40:53] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Terminal configuration for cookbooks - https://phabricator.wikimedia.org/T429129#12030491 (10jhathaway) >> @MoritzMuehlenhoff what do you think of the patch? Or do you want to find a way to retain the colors? > > The patch sounds great,... [16:43:01] (03CR) 10Volans: WMCS cinder backups: adjust retention (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1303487 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [16:45:12] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:45:25] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:45:44] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host lvs5005 - brett@cumin2002" [16:45:49] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host lvs5005 - brett@cumin2002" [16:45:50] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:45:50] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache lvs5005.eqsin.wmnet 6.0.132.10.in-addr.arpa 6.0.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [16:45:54] !log brett@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) lvs5005.eqsin.wmnet 6.0.132.10.in-addr.arpa 6.0.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [16:47:16] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache lvs5005.eqsin.wmnet 6.0.132.10.in-addr.arpa 6.0.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [16:47:20] !log brett@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) lvs5005.eqsin.wmnet 6.0.132.10.in-addr.arpa 6.0.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [16:47:41] !log brett@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs5005 [16:47:50] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:48:04] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:48:29] !log brett@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs5005 [16:48:29] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host lvs5005 [16:48:40] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:49:08] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:49:38] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-ats rolling restart_daemons on A:cp [16:51:00] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:51:12] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:52:22] FIRING: CertAlmostExpired: gNMI TLS certificate for lsw1-b7-codfw.mgmt.codfw.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:52:23] (03CR) 10RLazarus: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1303353 (https://phabricator.wikimedia.org/T428772) (owner: 10Blake) [16:53:44] (03CR) 10CDanis: [C:03+1] cache::haproxy: changing req.provenance to sess.provenance and log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1303473 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [16:58:33] (03CR) 10Dzahn: [C:03+2] "jenkins is stopped and puppet is disabled on contint1003 - merging this and then testing to re-enable puppet and double checking jenkins d" [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [16:58:51] PROBLEM - Confd vcl based reload on cp6011 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [16:58:59] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12030528 (10Jclark-ctr) 05Open→03Resolved [16:59:42] (03PS5) 10Hashar: jenkins: ensure service is absent on new Jenkins host [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [16:59:51] (03CR) 10Scott French: cache::haproxy: changing req.provenance to sess.provenance and log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1303473 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [16:59:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#12030531 (10Jclark-ctr) 05Stalled→03Resolved [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T1700) [17:00:40] (03CR) 10Dzahn: "oh, the commit is empty - after rebase it becomes clear. this was done by taavi in https://gerrit.wikimedia.org/r/c/operations/puppet/+/13" [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [17:01:25] (03CR) 10Dzahn: "thanks. I had a semi-duplicate of this waiting over here because it was in discussion https://gerrit.wikimedia.org/r/c/operations/puppet/+" [puppet] - 10https://gerrit.wikimedia.org/r/1301416 (owner: 10Majavah) [17:01:45] (03Abandoned) 10Dzahn: jenkins: ensure service is absent on new Jenkins host [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [17:03:01] !log contint1003 - re-enabling puppet - checking it does NOT start jenkins - also see gerrit:1297236 and gerrit:1301416 - T418521 [17:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:06] T418521: setup 2 contint machines for jenkins - https://phabricator.wikimedia.org/T418521 [17:04:51] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [17:04:51] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.30 ms [17:04:59] RECOVERY - jenkins_service_running on contint1003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [17:05:50] (03CR) 10CDobbins: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1303481 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [17:06:06] !log contint1003 - even with gerrit:1301416 jenkins was STILL restarted :/ - stopping it manually and puppet - debugging - T418521 [17:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:52] (03PS1) 10Pushpaktiwari: T429269: Send logged-in experiment events to ins-502b [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303490 [17:07:19] (03CR) 10Dzahn: "the thing is.. even with this change jenkins is STILL being restarted :/ and that confuses us all and we need to fix it" [puppet] - 10https://gerrit.wikimedia.org/r/1301416 (owner: 10Majavah) [17:07:23] RESOLVED: CertAlmostExpired: gNMI TLS certificate for lsw1-b7-codfw.mgmt.codfw.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:07:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [17:08:59] PROBLEM - jenkins_service_running on contint1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [17:09:04] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for laurabarluzzi - https://phabricator.wikimedia.org/T429431#12030596 (10BCornwall) 05Open→03In progress p:05Triage→03Medium [17:09:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for laurabarluzzi - https://phabricator.wikimedia.org/T429431#12030606 (10BCornwall) @XenoRyet, can you confirm access as approving manager? @Ottomata / @Ahoelzl / @Milimetric Can you approve this as group approvers? [17:10:37] (03CR) 10Dzahn: "even with that change - jenkins was still restarted :/ I went back to manual disable and trying to figure it out for real" [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [17:12:27] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T429474#12030624 (10BCornwall) [17:12:57] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T429474#12030630 (10BCornwall) @Suzie-WMDE can you confirm access as approving manager? @Ottomata / @Ahoelzl / @Milimetric Can you approve this as group approvers? [17:13:01] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#12030632 (10VRiley-WMF) @Marostegui thank you! [17:13:13] (03CR) 10Sohom Datta: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303490 (owner: 10Pushpaktiwari) [17:13:23] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T429474#12030638 (10BCornwall) 05Open→03In progress p:05Triage→03Medium [17:13:53] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [17:14:36] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Terminal configuration for cookbooks - https://phabricator.wikimedia.org/T429129#12030649 (10MoritzMuehlenhoff) >>! In T429129#12030491, @jhathaway wrote: >>> @MoritzMuehlenhoff what do you think of the patch? Or do you want to find a wa... [17:14:37] PROBLEM - Bird Internet Routing Daemon on dns7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:16:57] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5005.eqsin.wmnet with reason: host reimage [17:20:39] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5005.eqsin.wmnet with reason: host reimage [17:21:52] (03PS1) 10CDobbins: dnsrecursor: fix file name [puppet] - 10https://gerrit.wikimedia.org/r/1303497 (https://phabricator.wikimedia.org/T401832) [17:24:46] PROBLEM - NTP peers and stratum check on dns7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [17:25:04] (03PS1) 10Muehlenhoff: Record access for obenhmida [puppet] - 10https://gerrit.wikimedia.org/r/1303499 [17:27:01] (03CR) 10Ssingh: [C:03+1] dnsrecursor: fix file name [puppet] - 10https://gerrit.wikimedia.org/r/1303497 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [17:30:43] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8751/console" [puppet] - 10https://gerrit.wikimedia.org/r/1303497 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [17:31:54] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8752/co" [puppet] - 10https://gerrit.wikimedia.org/r/1303497 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [17:34:07] 06SRE, 10homer, 06Infrastructure-Foundations, 10netops: Homer should abort on filter rules applied on non-existent or disabled interfaces - https://phabricator.wikimedia.org/T428886#12030797 (10cmooney) 05Open→03Resolved a:03cmooney I'm going to close this one now. The patch to configure interfa... [17:36:17] (03PS2) 10CDobbins: dnsrecursor: fix file name [puppet] - 10https://gerrit.wikimedia.org/r/1303497 (https://phabricator.wikimedia.org/T401832) [17:37:13] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8753/co" [puppet] - 10https://gerrit.wikimedia.org/r/1303497 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [17:40:28] (03CR) 10Ssingh: [C:03+1] dnsrecursor: fix file name [puppet] - 10https://gerrit.wikimedia.org/r/1303497 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [17:41:55] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs5005.eqsin.wmnet with OS bookworm [17:42:56] brett@cumin2002 roll-restart-ats (PID 3946434) is awaiting input [17:49:39] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:57:41] (03CR) 10Andrew Bogott: [C:03+1] WMCS cinder backups: adjust retention [puppet] - 10https://gerrit.wikimedia.org/r/1303487 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [17:59:37] (03CR) 10Muehlenhoff: [C:03+2] Record access for obenhmida [puppet] - 10https://gerrit.wikimedia.org/r/1303499 (owner: 10Muehlenhoff) [18:00:05] jeena and dduvall: Your horoscope predicts another MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T1800). [18:01:17] (03PS1) 10Eric Gardner: Enable beta mobile MMV on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303504 (https://phabricator.wikimedia.org/T426775) [18:07:02] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp6011.* [18:07:25] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6011.drmrs.wmnet [18:07:34] RECOVERY - Confd vcl based reload on cp6011 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:08:08] (03PS1) 10Jdlrobson: Donor Delight Badge: Add accessible label and hide popover from AT [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303517 (https://phabricator.wikimedia.org/T427313) [18:14:32] PROBLEM - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns7002 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:15:04] (03CR) 10Volans: [C:03+2] WMCS cinder backups: adjust retention (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1303487 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [18:15:38] (03CR) 10LWatson: [C:03+1] Enable beta mobile MMV on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303504 (https://phabricator.wikimedia.org/T426775) (owner: 10Eric Gardner) [18:16:45] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6011.drmrs.wmnet [18:17:07] !log commit new lvs5005 IP address to cr2-eqsin.wikimedia.org,cr3-eqsin.wikimedia.org [18:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:37] (03PS1) 10TrainBranchBot: group1 to 1.47.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303521 (https://phabricator.wikimedia.org/T423916) [18:19:41] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303521 (https://phabricator.wikimedia.org/T423916) (owner: 10TrainBranchBot) [18:19:43] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp6011.drmrs.wmnet with reason: ats restart, continuing from failed cookbook run [18:21:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302923 (https://phabricator.wikimedia.org/T429380) (owner: 10Lerickson) [18:22:09] (03PS1) 10Dzahn: jenkins: ensure jenkins service is properly masked and stopped [puppet] - 10https://gerrit.wikimedia.org/r/1303522 [18:23:21] (03PS4) 10Slyngshede: C:dumps::web::xmldumps block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1297102 (https://phabricator.wikimedia.org/T427836) [18:23:47] (03CR) 10Slyngshede: "Maybe I got it unweirded." [puppet] - 10https://gerrit.wikimedia.org/r/1297102 (https://phabricator.wikimedia.org/T427836) (owner: 10Slyngshede) [18:24:06] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp6011.drmrs.wmnet [18:24:07] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp6011.drmrs.wmnet [18:24:14] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp6011.* [18:25:21] (03CR) 10BCornwall: [C:03+1] C:dumps::web::xmldumps block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1297102 (https://phabricator.wikimedia.org/T427836) (owner: 10Slyngshede) [18:26:19] (03CR) 10BCornwall: [C:03+1] Add Kubernetes POD IP reverse range delegations for wikikube-ctrl1005 [dns] - 10https://gerrit.wikimedia.org/r/1302996 (https://phabricator.wikimedia.org/T418920) (owner: 10Jasmine) [18:30:34] (03CR) 10CDobbins: [V:03+1 C:03+2] dnsrecursor: fix file name [puppet] - 10https://gerrit.wikimedia.org/r/1303497 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [18:33:10] FIRING: [2x] BFDdown: BFD session down between asw1-b4-magru and 195.200.68.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b4-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:33:56] (03Merged) 10jenkins-bot: group1 to 1.47.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303521 (https://phabricator.wikimedia.org/T423916) (owner: 10TrainBranchBot) [18:35:00] (03PS1) 10Arlolra: Configure $wgTrackPreExpansion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303525 (https://phabricator.wikimedia.org/T353697) [18:36:33] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol1008-dev.eqiad.wmnet - https://phabricator.wikimedia.org/T429527 (10BLiviero-WMF) 03NEW [18:38:12] (03PS2) 10Arlolra: Configure $wgTrackPreExpansion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303525 (https://phabricator.wikimedia.org/T353697) [18:39:39] FIRING: [6x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-wdqs2001:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [18:41:02] (03PS15) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [18:43:56] (03PS1) 10Andrew Bogott: Remove refs to cloudcontrol1008-dev [puppet] - 10https://gerrit.wikimedia.org/r/1303528 (https://phabricator.wikimedia.org/T429527) [18:44:23] (03PS1) 10JHathaway: weak etag comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303529 [18:46:32] !log andrew@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudcontrol1008-dev.eqiad.wmnet [18:46:58] PROBLEM - jenkins_service_running on contint1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [18:47:01] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [18:48:33] (03CR) 10Jforrester: [C:03+1] Enable beta mobile MMV on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303504 (https://phabricator.wikimedia.org/T426775) (owner: 10Eric Gardner) [18:49:58] RECOVERY - jenkins_service_running on contint1002 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [18:51:43] (03CR) 10CI reject: [V:04-1] Remove refs to cloudcontrol1008-dev [puppet] - 10https://gerrit.wikimedia.org/r/1303528 (https://phabricator.wikimedia.org/T429527) (owner: 10Andrew Bogott) [18:51:57] (03PS1) 10TrainBranchBot: group1 to 1.47.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303530 (https://phabricator.wikimedia.org/T423916) [18:52:00] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303530 (https://phabricator.wikimedia.org/T423916) (owner: 10TrainBranchBot) [18:52:16] !log andrew@cumin2002 START - Cookbook sre.dns.netbox [18:52:56] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1303528 (https://phabricator.wikimedia.org/T429527) (owner: 10Andrew Bogott) [18:58:19] andrew@cumin2002 decommission (PID 3973763) is awaiting input [18:58:46] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [18:59:28] !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1008-dev.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin2002" [18:59:58] (03CR) 10Andrew Bogott: [C:03+2] Remove refs to cloudcontrol1008-dev [puppet] - 10https://gerrit.wikimedia.org/r/1303528 (https://phabricator.wikimedia.org/T429527) (owner: 10Andrew Bogott) [19:00:02] !log andrew@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1008-dev.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin2002" [19:00:03] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:00:05] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1008-dev.eqiad.wmnet [19:00:16] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission cloudcontrol1008-dev.eqiad.wmnet - https://phabricator.wikimedia.org/T429527#12031164 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin2002 for hosts: `cloudcontrol1008-dev.eqiad.wmnet` - cloudco... [19:01:35] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission cloudcontrol1008-dev.eqiad.wmnet - https://phabricator.wikimedia.org/T429527#12031172 (10Andrew) a:05Andrew→03None [19:01:57] !log jhuneidi@deploy1003 Started scap sync-world: Attempt to roll wmf.7 to group 1 [19:03:29] (03Abandoned) 10Jeena Huneidi: group1 to 1.47.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303530 (https://phabricator.wikimedia.org/T423916) (owner: 10TrainBranchBot) [19:04:32] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [19:05:59] (03CR) 10Gmodena: [C:03+1] EventStreamConfig: add stream for WDQS V2 external queries. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302923 (https://phabricator.wikimedia.org/T429380) (owner: 10Lerickson) [19:08:28] (03CR) 10BCornwall: IDP: Bump local version, 7.3.7.2+wmf13u2 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1303380 (owner: 10Slyngshede) [19:08:48] !log jhuneidi@deploy1003 Finished scap sync-world: Attempt to roll wmf.7 to group 1 (duration: 07m 24s) [19:10:00] !log jhuneidi@deploy1003 Started scap sync-world: wmf.7 to group 1 (Take 2) [19:11:41] (03CR) 10BCornwall: [C:04-1] varnish: Add CSP report-only header value (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [19:12:55] (03PS1) 10BCornwall: Revert "lvs5005: Set lowest bgp priority during reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1303533 [19:16:03] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-purged (exit_code=0) rolling restart_daemons on A:cp and not P{cp7001.magru.wmnet} and A:cp [19:16:31] !log jhuneidi@deploy1003 Finished scap sync-world: wmf.7 to group 1 (Take 2) (duration: 07m 01s) [19:17:24] (03PS1) 10CDobbins: dnsrecursor: add quotes to outgoing.dont_query vals [puppet] - 10https://gerrit.wikimedia.org/r/1303534 (https://phabricator.wikimedia.org/T401832) [19:19:12] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8755/co" [puppet] - 10https://gerrit.wikimedia.org/r/1303534 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [19:19:39] FIRING: JobUnavailable: Reduced availability for job pdnsrec in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:25:42] (03CR) 10Ssingh: [C:03+1] Revert "lvs5005: Set lowest bgp priority during reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1303533 (owner: 10BCornwall) [19:25:54] (03CR) 10BCornwall: [C:03+2] Revert "lvs5005: Set lowest bgp priority during reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1303533 (owner: 10BCornwall) [19:27:30] jouncebot: nowandnext [19:27:30] For the next 0 hour(s) and 32 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T1800) [19:27:30] In 0 hour(s) and 32 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T2000) [19:30:27] (03PS2) 10CDobbins: dnsrecursor: add quotes to outgoing.dont_query vals [puppet] - 10https://gerrit.wikimedia.org/r/1303534 (https://phabricator.wikimedia.org/T401832) [19:30:54] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs5005*} and A:liberica (T428229) [19:30:59] T428229: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229 [19:31:18] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs5005*} and A:liberica (T428229) [19:31:22] (03CR) 10Ssingh: [C:03+1] "Looks good but I suspect we will have more such failures with YAML and :: not being quoted. I guess we will find out." [puppet] - 10https://gerrit.wikimedia.org/r/1303534 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [19:31:30] FIRING: LibericaStaleConfig: Liberica instance lvs5005 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=eqsin&var-instance=lvs5005 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [19:32:29] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8756/co" [puppet] - 10https://gerrit.wikimedia.org/r/1303534 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [19:33:59] (03CR) 10CDobbins: [V:03+1 C:03+2] dnsrecursor: add quotes to outgoing.dont_query vals [puppet] - 10https://gerrit.wikimedia.org/r/1303534 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [19:34:59] (03PS1) 10Jgreen: Switch frdata-eqiad.wikimedia.org to the new server's public IP [dns] - 10https://gerrit.wikimedia.org/r/1303537 [19:36:30] RESOLVED: LibericaStaleConfig: Liberica instance lvs5005 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=eqsin&var-instance=lvs5005 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [19:40:33] (03CR) 10Dwisehaupt: [C:03+2] Switch frdata-eqiad.wikimedia.org to the new server's public IP [dns] - 10https://gerrit.wikimedia.org/r/1303537 (owner: 10Jgreen) [19:41:24] (03CR) 10Jgreen: [C:03+2] Switch frdata-eqiad.wikimedia.org to the new server's public IP [dns] - 10https://gerrit.wikimedia.org/r/1303537 (owner: 10Jgreen) [19:41:44] (03PS1) 10CDobbins: hieradata: override dns7002; use correct cfg file [puppet] - 10https://gerrit.wikimedia.org/r/1303540 (https://phabricator.wikimedia.org/T401832) [19:42:44] !log jgreen@dns1005 START - running authdns-update [19:42:55] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8757/co" [puppet] - 10https://gerrit.wikimedia.org/r/1303540 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [19:44:35] !log jgreen@dns1005 END - running authdns-update [19:49:33] (03CR) 10BPirkle: [C:03+1] "Seems reasonable based on the previous change to split these urls (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageViewInfo/+/12" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300892 (https://phabricator.wikimedia.org/T411771) (owner: 10TChin) [19:50:59] (03CR) 10Gmodena: [C:03+1] EventStreamConfig: add stream for WDQS V2 external queries. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302923 (https://phabricator.wikimedia.org/T429380) (owner: 10Lerickson) [19:51:32] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#12031362 (10MarioProtIV) Looks like it’s been fixed as new links once again work on timelines. At least on English Wikipedia. [19:53:17] (03CR) 10CI reject: [V:04-1] weak etag comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303529 (owner: 10JHathaway) [19:57:19] (03CR) 10BCornwall: [C:03+1] hieradata: override dns7002; use correct cfg file [puppet] - 10https://gerrit.wikimedia.org/r/1303540 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [19:57:26] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#12031378 (10SomeRandomDeveloper) 05Open→03Resolved a:03SomeRandomDeveloper [19:57:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol1008-dev.eqiad.wmnet - https://phabricator.wikimedia.org/T429527#12031381 (10Jclark-ctr) a:03Jclark-ctr D5 U38 [19:58:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol1008-dev.eqiad.wmnet - https://phabricator.wikimedia.org/T429527#12031384 (10Jclark-ctr) [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T2000). [20:00:05] Sergi0, abijeet, and bpirkle: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:17] o/ [20:00:19] I'm. here [20:00:27] o/ I'm here to help with Abijeet's patch [20:01:01] sergi0 can you deploy you own patch? [20:01:20] sure, should I go first? [20:01:21] o/ I'm here to test Abijeet's patch [20:01:42] sergi0 go for it, I'll be next [20:02:21] (03PS2) 10BCornwall: admin: Add new SSH key for denisse [puppet] - 10https://gerrit.wikimedia.org/r/1303008 (https://phabricator.wikimedia.org/T429429) (owner: 10Andrea Denisse) [20:02:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303365 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [20:02:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303364 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [20:02:30] (03CR) 10BCornwall: [V:03+2 C:03+2] "Confirmed with video call" [puppet] - 10https://gerrit.wikimedia.org/r/1303008 (https://phabricator.wikimedia.org/T429429) (owner: 10Andrea Denisse) [20:04:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Change SSH key for denisse after new laptop provissioning - https://phabricator.wikimedia.org/T429429#12031411 (10BCornwall) 05Open→03In progress p:05Triage→03Medium [20:04:17] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Change SSH key for denisse after new laptop provissioning - https://phabricator.wikimedia.org/T429429#12031416 (10BCornwall) [20:04:39] FIRING: [2x] SystemdUnitFailed: cowbuilder_update_bookworm-amd64.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:44] My change is a very minor config update with no visible production impact. Can it ride with Abijeet's [20:04:53] (03Merged) 10jenkins-bot: migrateMentorStatusAway: Return SIMULATED for all dry-run executions [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303365 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [20:04:58] (03PS17) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [20:05:11] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [20:05:18] (03PS1) 10Eric Gardner: Image Browsing: fix transparent images in carousel [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303552 (https://phabricator.wikimedia.org/T429047) [20:05:49] (03PS1) 10Eric Gardner: MMV Beta Viewer: Make in-flight image downloads abortable [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303553 (https://phabricator.wikimedia.org/T429193) [20:07:03] sergi0: bpirkle: stephanebisson: hueitan: CI is backlogged :-\ [20:08:48] hashar: acknowledged, thanks [20:09:11] I think it will processes the patch you +2 cause eg operations/mediawiki-config and backpots to wmf/* branches have higher precedence [20:10:26] (03CR) 10CDobbins: [V:03+1 C:03+2] hieradata: override dns7002; use correct cfg file [puppet] - 10https://gerrit.wikimedia.org/r/1303540 (https://phabricator.wikimedia.org/T401832) (owner: 10CDobbins) [20:10:28] (03Merged) 10jenkins-bot: migrateMentorStatusAway: Return SIMULATED for all dry-run executions [extensions/GrowthExperiments] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303364 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [20:10:46] now [20:10:50] and the backlog visible at https://integration.wikimedia.org/zuul/ seems to be entirely in low precedence pipelines ( patch-performance , codehealth, coverage) [20:10:58] !log sgimeno@deploy1003 Started scap sync-world: Backport for [[gerrit:1303365|migrateMentorStatusAway: Return SIMULATED for all dry-run executions (T409170)]], [[gerrit:1303364|migrateMentorStatusAway: Return SIMULATED for all dry-run executions (T409170)]] [20:11:00] ah yeah it got a wmf/* patch merged 🎉 [20:11:00] (03PS3) 10Lerickson: EventStreamConfig: add stream for WDQS V2 external/internal queries. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302923 (https://phabricator.wikimedia.org/T429380) [20:11:02] T409170: Run MigrateMentorStatusAway migration script - https://phabricator.wikimedia.org/T409170 [20:12:36] (03CR) 10Lerickson: EventStreamConfig: add stream for WDQS V2 external/internal queries. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302923 (https://phabricator.wikimedia.org/T429380) (owner: 10Lerickson) [20:12:57] !log sgimeno@deploy1003 sgimeno: Backport for [[gerrit:1303365|migrateMentorStatusAway: Return SIMULATED for all dry-run executions (T409170)]], [[gerrit:1303364|migrateMentorStatusAway: Return SIMULATED for all dry-run executions (T409170)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:13:35] !log sgimeno@deploy1003 sgimeno: Continuing with deployment [20:14:15] (03PS1) 10RLazarus: Rebuild for Trixie [software/httpbb] - 10https://gerrit.wikimedia.org/r/1303557 (https://phabricator.wikimedia.org/T427899) [20:14:53] (03PS1) 10Andrew Bogott: cloud-vps vendordata: quote 'from' host when adding cumin pubkey [puppet] - 10https://gerrit.wikimedia.org/r/1303558 (https://phabricator.wikimedia.org/T422801) [20:15:29] RECOVERY - Recursive DNS on 195.200.68.37 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:15:29] RECOVERY - Recursive DNS on 2a02:ec80:700:2:195:200:68:37 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:16:14] (03PS2) 10RLazarus: Rebuild for Trixie [software/httpbb] - 10https://gerrit.wikimedia.org/r/1303557 (https://phabricator.wikimedia.org/T427899) [20:16:17] (03CR) 10ArielGlenn: [C:03+1] rest-gateway: put request ID into rate limit respose [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300775 (owner: 10Daniel Kinzler) [20:17:45] Almost there [20:17:53] !log sgimeno@deploy1003 Finished scap sync-world: Backport for [[gerrit:1303365|migrateMentorStatusAway: Return SIMULATED for all dry-run executions (T409170)]], [[gerrit:1303364|migrateMentorStatusAway: Return SIMULATED for all dry-run executions (T409170)]] (duration: 06m 55s) [20:17:58] T409170: Run MigrateMentorStatusAway migration script - https://phabricator.wikimedia.org/T409170 [20:18:12] @stephanebisson all yours [20:18:17] sergi0 TY [20:18:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303012 (owner: 10Abijeet Patro) [20:19:39] RESOLVED: JobUnavailable: Reduced availability for job pdnsrec in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:20:31] hashar my config patch is queued, any way to know where it sits in the queue and what its ETA could be? [20:20:56] (03PS3) 10Eric Gardner: MMV Beta Viewer: Delay the loading indicator on quick navigation [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303554 (https://phabricator.wikimedia.org/T429193) [20:23:24] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1303558 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [20:23:50] Hey all, just a heads up that I will be deploying some patches during the Readers deployment window that starts in about 2 hours [20:26:05] (03PS2) 10BCornwall: admin: Add new SSH key for denisse [puppet] - 10https://gerrit.wikimedia.org/r/1303556 (https://phabricator.wikimedia.org/T429429) (owner: 10Andrea Denisse) [20:26:06] (03CR) 10BCornwall: [C:03+2] admin: Add new SSH key for denisse [puppet] - 10https://gerrit.wikimedia.org/r/1303556 (https://phabricator.wikimedia.org/T429429) (owner: 10Andrea Denisse) [20:26:26] (03CR) 10BCornwall: [V:03+2 C:03+2] "Verified with video call" [puppet] - 10https://gerrit.wikimedia.org/r/1303556 (https://phabricator.wikimedia.org/T429429) (owner: 10Andrea Denisse) [20:27:18] (03Merged) 10jenkins-bot: Enable ULS v2 on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303012 (owner: 10Abijeet Patro) [20:27:35] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps vendordata: quote 'from' host when adding cumin pubkey [puppet] - 10https://gerrit.wikimedia.org/r/1303558 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [20:27:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300892 (https://phabricator.wikimedia.org/T411771) (owner: 10TChin) [20:27:44] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1303012|Enable ULS v2 on group1 wikis]] [20:28:29] stephanebisson: not really but it eventually merged [20:29:44] !log sbisson@deploy1003 sbisson, abi: Backport for [[gerrit:1303012|Enable ULS v2 on group1 wikis]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:29:54] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 06Release-Engineering-Team (Radar), 07User-notice: Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#12031521 (10dancy) [20:30:19] I am off ! [20:31:00] hashar: indeed, thanks [20:31:06] hueitan can you test? [20:31:09] :-] [20:31:13] happy deployment! [20:31:39] stephanebisson done testing, works perfectly [20:31:45] hueitan thanks [20:31:53] !log sbisson@deploy1003 sbisson, abi: Continuing with deployment [20:35:11] (03CR) 10Gmodena: [C:03+1] EventStreamConfig: add stream for WDQS V2 external/internal queries. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302923 (https://phabricator.wikimedia.org/T429380) (owner: 10Lerickson) [20:35:12] (03CR) 10Lerickson: EventStreamConfig: add stream for WDQS V2 external/internal queries. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302923 (https://phabricator.wikimedia.org/T429380) (owner: 10Lerickson) [20:36:07] (03PS1) 10JHathaway: Change find_account to find_accounts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303559 (https://phabricator.wikimedia.org/T426180) [20:36:10] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1303012|Enable ULS v2 on group1 wikis]] (duration: 08m 26s) [20:36:33] over to you bpirkle [20:37:32] Hrm, having to deal with something unrelated here. I'll reschedule mine. [20:37:59] (03CR) 10JHathaway: sre.hosts.provision: introduce the wmfroot user (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [20:38:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:38:42] (03CR) 10JHathaway: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303529 (owner: 10JHathaway) [20:40:30] (03PS1) 10RLazarus: Updating docker-pkg to 4.0.5 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1303560 [20:41:12] !log cdobbins@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns7002.wikimedia.org with OS trixie [20:41:44] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-ats (exit_code=0) rolling restart_daemons on A:cp [20:45:12] !log cdobbins@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dns7002.wikimedia.org with reason: bird.service keeps failing [20:45:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2255:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2255 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T2100) [21:00:30] (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-06-09-174730 to 2026-06-17-184727 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303482 (https://phabricator.wikimedia.org/T282922) [21:00:30] (03PS3) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-06-11-171152 to 2026-06-16-183209 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303434 (https://phabricator.wikimedia.org/T426336) [21:00:30] (03PS3) 10Jforrester: wikifunctions: Switch JavaScript evaluator to Rust-based version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300271 (https://phabricator.wikimedia.org/T417870) [21:00:31] (03PS3) 10Jforrester: wikifunctions: Drop temporary Rust evaluator releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300272 (https://phabricator.wikimedia.org/T417870) [21:00:35] (03PS1) 10Andrew Bogott: cloud-vps: add ssh access to magnum workers in paws and zuul projects [puppet] - 10https://gerrit.wikimedia.org/r/1303564 (https://phabricator.wikimedia.org/T422801) [21:00:49] (03PS2) 10Andrew Bogott: cloud-vps: add ssh access to magnum workers in paws and zuul projects [puppet] - 10https://gerrit.wikimedia.org/r/1303564 (https://phabricator.wikimedia.org/T422801) [21:01:50] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade evaluators from 2026-06-09-174730 to 2026-06-17-184727 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303482 (https://phabricator.wikimedia.org/T282922) (owner: 10Jforrester) [21:02:02] !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [21:02:31] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps: add ssh access to magnum workers in paws and zuul projects [puppet] - 10https://gerrit.wikimedia.org/r/1303564 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [21:02:35] !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [21:02:56] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-esams rpd failure after enabling bgp 'graceful-shutdown' (June 2026) - https://phabricator.wikimedia.org/T429386#12031657 (10cmooney) Juniper have come back to say this is known bug, somewhat expected I guess. ` After decoding the coredump, it was confirmed... [21:04:22] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-06-09-174730 to 2026-06-17-184727 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303482 (https://phabricator.wikimedia.org/T282922) (owner: 10Jforrester) [21:05:19] !log ecarg@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:06:09] !log ecarg@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:09:36] !log ecarg@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:11:23] 10SRE-swift-storage, 06Commons, 06DBA, 10media-backups, and 2 others: old file revisions missing of File:A_Warm_Shade_of_Ivory_-_Henry_Mancini_album_cover.jpg - https://phabricator.wikimedia.org/T428406#12031693 (10Zabe) I can confirm it is related to the file migration schema. I set mw-experimental to rea... [21:11:52] 10SRE-swift-storage, 06Commons, 06DBA, 10media-backups, and 2 others: old file revisions missing of File:A_Warm_Shade_of_Ivory_-_Henry_Mancini_album_cover.jpg - https://phabricator.wikimedia.org/T428406#12031695 (10Zabe) p:05Triage→03High [21:12:10] !log ecarg@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:12:21] !log ecarg@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:12:54] (03CR) 10CI reject: [V:04-1] Change find_account to find_accounts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303559 (https://phabricator.wikimedia.org/T426180) (owner: 10JHathaway) [21:15:25] !log ecarg@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:16:08] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:16:38] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-06-11-171152 to 2026-06-16-183209 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303434 (https://phabricator.wikimedia.org/T426336) (owner: 10Jforrester) [21:19:27] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-06-11-171152 to 2026-06-16-183209 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303434 (https://phabricator.wikimedia.org/T426336) (owner: 10Jforrester) [21:19:42] (03PS4) 10Jforrester: wikifunctions: Switch JavaScript evaluator to Rust-based version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300271 (https://phabricator.wikimedia.org/T417870) [21:19:42] (03PS4) 10Jforrester: wikifunctions: Drop temporary Rust evaluator releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300272 (https://phabricator.wikimedia.org/T417870) [21:19:42] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-06-16-183209 to 2026-06-17-182805 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303567 (https://phabricator.wikimedia.org/T427644) [21:20:23] !log ecarg@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:20:46] !log ecarg@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:21:18] !log ecarg@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:22:08] !log ecarg@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:22:17] !log ecarg@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:23:06] !log ecarg@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:23:46] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-06-16-183209 to 2026-06-17-182805 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303567 (https://phabricator.wikimedia.org/T427644) (owner: 10Jforrester) [21:25:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2255:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2255 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:26:14] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-06-16-183209 to 2026-06-17-182805 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303567 (https://phabricator.wikimedia.org/T427644) (owner: 10Jforrester) [21:27:18] !log ecarg@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:27:51] !log ecarg@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:28:22] !log ecarg@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:29:09] !log ecarg@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:29:28] !log ecarg@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:29:59] !log ecarg@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:38:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2255:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2255 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:39:37] (03PS1) 10Eric Gardner: Image Browsing: fix transparent images in carousel [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303571 (https://phabricator.wikimedia.org/T429047) [21:40:35] (03PS1) 10Eric Gardner: MMV Beta Viewer: Make in-flight image downloads abortable [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303572 (https://phabricator.wikimedia.org/T429193) [21:40:38] (03PS1) 10Eric Gardner: MMV Beta Viewer: Delay the loading indicator on quick navigation [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303573 (https://phabricator.wikimedia.org/T429193) [21:44:14] (03CR) 10Muehlenhoff: Rebuild for Trixie (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/1303557 (https://phabricator.wikimedia.org/T427899) (owner: 10RLazarus) [21:45:32] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.4.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303575 [21:47:25] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.4.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303575 (owner: 10Santiago Faci) [21:49:37] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.4.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303575 (owner: 10Santiago Faci) [21:49:39] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:52:04] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:52:27] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:58:07] (03PS5) 10Cathal Mooney: Cookbook to configure switch port vlans for cloud hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1303397 (https://phabricator.wikimedia.org/T429466) [21:58:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2255:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2255 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260617T2200) [22:02:10] I will deploy 3 sets of patches shortly, but I think Jdlrobson wanted to get a quick one in first [22:02:29] (03Abandoned) 10Dzahn: jenkins: ensure jenkins service is properly masked and stopped [puppet] - 10https://gerrit.wikimedia.org/r/1303522 (owner: 10Dzahn) [22:03:32] EricGardner: thanks [22:04:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303517 (https://phabricator.wikimedia.org/T427313) (owner: 10Jdlrobson) [22:12:16] (03PS1) 10Dzahn: jenkins: use systemd::mask to mask and ensure service is ALSO stopped [puppet] - 10https://gerrit.wikimedia.org/r/1303578 [22:12:53] (03CR) 10CI reject: [V:04-1] jenkins: use systemd::mask to mask and ensure service is ALSO stopped [puppet] - 10https://gerrit.wikimedia.org/r/1303578 (owner: 10Dzahn) [22:13:21] (03PS2) 10Dzahn: jenkins: use systemd::mask to mask and ensure service is ALSO stopped [puppet] - 10https://gerrit.wikimedia.org/r/1303578 [22:13:37] (03Merged) 10jenkins-bot: Donor Delight Badge: Add accessible label and hide popover from AT [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303517 (https://phabricator.wikimedia.org/T427313) (owner: 10Jdlrobson) [22:14:03] (03CR) 10CI reject: [V:04-1] jenkins: use systemd::mask to mask and ensure service is ALSO stopped [puppet] - 10https://gerrit.wikimedia.org/r/1303578 (owner: 10Dzahn) [22:14:08] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1303517|Donor Delight Badge: Add accessible label and hide popover from AT (T427313)]] [22:14:12] T427313: Donor badge experiment: Final design review and adjustments for donor badge - https://phabricator.wikimedia.org/T427313 [22:19:16] (03PS3) 10Dzahn: jenkins: use systemd::mask to mask and ensure service is ALSO stopped [puppet] - 10https://gerrit.wikimedia.org/r/1303578 [22:20:09] (03CR) 10CI reject: [V:04-1] jenkins: use systemd::mask to mask and ensure service is ALSO stopped [puppet] - 10https://gerrit.wikimedia.org/r/1303578 (owner: 10Dzahn) [22:22:30] (03PS4) 10Dzahn: jenkins: use systemd::mask to mask and ensure service is ALSO stopped [puppet] - 10https://gerrit.wikimedia.org/r/1303578 [22:23:34] (03CR) 10CI reject: [V:04-1] jenkins: use systemd::mask to mask and ensure service is ALSO stopped [puppet] - 10https://gerrit.wikimedia.org/r/1303578 (owner: 10Dzahn) [22:27:18] (03PS6) 10Cathal Mooney: Cookbook to configure switch port vlans for cloud hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1303397 (https://phabricator.wikimedia.org/T429466) [22:31:50] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1303517|Donor Delight Badge: Add accessible label and hide popover from AT (T427313)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:31:55] T427313: Donor badge experiment: Final design review and adjustments for donor badge - https://phabricator.wikimedia.org/T427313 [22:32:47] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [22:33:25] FIRING: [2x] BFDdown: BFD session down between asw1-b4-magru and 195.200.68.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b4-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:33:34] (03PS5) 10Dzahn: jenkins: ensure if service is masked it is also stopped [puppet] - 10https://gerrit.wikimedia.org/r/1303578 [22:34:29] (03CR) 10CI reject: [V:04-1] jenkins: ensure if service is masked it is also stopped [puppet] - 10https://gerrit.wikimedia.org/r/1303578 (owner: 10Dzahn) [22:39:39] FIRING: [6x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-wdqs2001:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [22:41:03] (03PS6) 10Dzahn: jenkins: ensure if service is masked it is also stopped [puppet] - 10https://gerrit.wikimedia.org/r/1303578 [22:45:09] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1303517|Donor Delight Badge: Add accessible label and hide popover from AT (T427313)]] (duration: 31m 01s) [22:45:14] T427313: Donor badge experiment: Final design review and adjustments for donor badge - https://phabricator.wikimedia.org/T427313 [22:45:21] ok, beginning first of 3 sets of patches [22:45:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303571 (https://phabricator.wikimedia.org/T429047) (owner: 10Eric Gardner) [22:45:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303572 (https://phabricator.wikimedia.org/T429193) (owner: 10Eric Gardner) [22:45:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303573 (https://phabricator.wikimedia.org/T429193) (owner: 10Eric Gardner) [22:49:08] (03CR) 10RLazarus: Rebuild for Trixie (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/1303557 (https://phabricator.wikimedia.org/T427899) (owner: 10RLazarus) [22:49:21] (03PS7) 10Dzahn: jenkins: ensure if service is masked it is also stopped [puppet] - 10https://gerrit.wikimedia.org/r/1303578 (https://phabricator.wikimedia.org/T418521) [22:50:42] (03CR) 10Dzahn: "this fixes it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1303578" [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [22:50:47] (03CR) 10Dzahn: "this fixes it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1303578" [puppet] - 10https://gerrit.wikimedia.org/r/1303522 (owner: 10Dzahn) [22:51:13] (03Merged) 10jenkins-bot: Image Browsing: fix transparent images in carousel [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303571 (https://phabricator.wikimedia.org/T429047) (owner: 10Eric Gardner) [22:51:29] (03Merged) 10jenkins-bot: MMV Beta Viewer: Make in-flight image downloads abortable [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303572 (https://phabricator.wikimedia.org/T429193) (owner: 10Eric Gardner) [22:51:32] (03Merged) 10jenkins-bot: MMV Beta Viewer: Delay the loading indicator on quick navigation [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1303573 (https://phabricator.wikimedia.org/T429193) (owner: 10Eric Gardner) [22:52:05] !log egardner@deploy1003 Started scap sync-world: Backport for [[gerrit:1303571|Image Browsing: fix transparent images in carousel (T429047)]], [[gerrit:1303572|MMV Beta Viewer: Make in-flight image downloads abortable (T429193)]], [[gerrit:1303573|MMV Beta Viewer: Delay the loading indicator on quick navigation (T429193)]] [22:52:11] T429047: [Image Browsing] transparent background images - https://phabricator.wikimedia.org/T429047 [22:52:12] T429193: Pagination lags or skips because of large images - https://phabricator.wikimedia.org/T429193 [22:52:14] (03CR) 10Dzahn: [C:03+2] "confirming noop on contint1002 - then re-enabling puppet on contint1003 and confirming it stays stopped" [puppet] - 10https://gerrit.wikimedia.org/r/1303578 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [22:53:49] andrewbogott: m-m-m-m-multi merge. go ahead? [22:54:08] oh! That explains why that patch didn't do anything [22:54:09] yes please! [22:54:26] ok:) sync in progress [22:54:54] (03CR) 10Thcipriani: [C:04-1] "Something is reaping subprocesses out from under os.wait in this implementation. In testing (by replacing the database update call with a " [puppet] - 10https://gerrit.wikimedia.org/r/1302910 (owner: 10Ahmon Dancy) [22:55:09] andrewbogott: should do something now [22:55:23] thanks! [22:56:08] !log egardner@deploy1003 egardner: Backport for [[gerrit:1303571|Image Browsing: fix transparent images in carousel (T429047)]], [[gerrit:1303572|MMV Beta Viewer: Make in-flight image downloads abortable (T429193)]], [[gerrit:1303573|MMV Beta Viewer: Delay the loading indicator on quick navigation (T429193)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:57:58] !log egardner@deploy1003 egardner: Continuing with deployment [23:03:10] It works! [23:04:36] !log egardner@deploy1003 Finished scap sync-world: Backport for [[gerrit:1303571|Image Browsing: fix transparent images in carousel (T429047)]], [[gerrit:1303572|MMV Beta Viewer: Make in-flight image downloads abortable (T429193)]], [[gerrit:1303573|MMV Beta Viewer: Delay the loading indicator on quick navigation (T429193)]] (duration: 12m 31s) [23:04:42] T429047: [Image Browsing] transparent background images - https://phabricator.wikimedia.org/T429047 [23:04:43] T429193: Pagination lags or skips because of large images - https://phabricator.wikimedia.org/T429193 [23:04:46] Ok, on to the second set of 3 [23:05:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303552 (https://phabricator.wikimedia.org/T429047) (owner: 10Eric Gardner) [23:05:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303553 (https://phabricator.wikimedia.org/T429193) (owner: 10Eric Gardner) [23:05:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303554 (https://phabricator.wikimedia.org/T429193) (owner: 10Eric Gardner) [23:07:15] (03CR) 10Ahmon Dancy: "Ah yes. I'll find a different way to do this (perhaps with a context manager and a worker pool) which keeps resource boundaries clear." [puppet] - 10https://gerrit.wikimedia.org/r/1302910 (owner: 10Ahmon Dancy) [23:07:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 5.475% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:08:38] (03Merged) 10jenkins-bot: Image Browsing: fix transparent images in carousel [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303552 (https://phabricator.wikimedia.org/T429047) (owner: 10Eric Gardner) [23:09:30] (03Merged) 10jenkins-bot: MMV Beta Viewer: Make in-flight image downloads abortable [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303553 (https://phabricator.wikimedia.org/T429193) (owner: 10Eric Gardner) [23:09:33] (03Merged) 10jenkins-bot: MMV Beta Viewer: Delay the loading indicator on quick navigation [extensions/MultimediaViewer] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1303554 (https://phabricator.wikimedia.org/T429193) (owner: 10Eric Gardner) [23:10:05] !log egardner@deploy1003 Started scap sync-world: Backport for [[gerrit:1303552|Image Browsing: fix transparent images in carousel (T429047)]], [[gerrit:1303553|MMV Beta Viewer: Make in-flight image downloads abortable (T429193)]], [[gerrit:1303554|MMV Beta Viewer: Delay the loading indicator on quick navigation (T429193)]] [23:10:11] T429047: [Image Browsing] transparent background images - https://phabricator.wikimedia.org/T429047 [23:10:11] T429193: Pagination lags or skips because of large images - https://phabricator.wikimedia.org/T429193 [23:12:01] !log egardner@deploy1003 egardner: Backport for [[gerrit:1303552|Image Browsing: fix transparent images in carousel (T429047)]], [[gerrit:1303553|MMV Beta Viewer: Make in-flight image downloads abortable (T429193)]], [[gerrit:1303554|MMV Beta Viewer: Delay the loading indicator on quick navigation (T429193)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:12:38] !log egardner@deploy1003 egardner: Continuing with deployment [23:14:12] !log gerrit2002 - unlink /srv/gerrit/site_path/review_site/logs/logs (T425667) [23:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:16] T425667: Investigate Gerrit root disk usage and logging - https://phabricator.wikimedia.org/T425667 [23:16:30] PROBLEM - Check unit status of security_group_ssh-from-restricted-bastion_to_project_zuul on cloudcontrol1006 is CRITICAL: CRITICAL: Status of the systemd unit security_group_ssh-from-restricted-bastion_to_project_zuul https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:17:00] !log egardner@deploy1003 Finished scap sync-world: Backport for [[gerrit:1303552|Image Browsing: fix transparent images in carousel (T429047)]], [[gerrit:1303553|MMV Beta Viewer: Make in-flight image downloads abortable (T429193)]], [[gerrit:1303554|MMV Beta Viewer: Delay the loading indicator on quick navigation (T429193)]] (duration: 06m 55s) [23:17:07] T429047: [Image Browsing] transparent background images - https://phabricator.wikimedia.org/T429047 [23:17:07] T429193: Pagination lags or skips because of large images - https://phabricator.wikimedia.org/T429193 [23:17:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.62% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:17:33] Ok, about to start final deploy (config patch) [23:17:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303504 (https://phabricator.wikimedia.org/T426775) (owner: 10Eric Gardner) [23:19:20] (03Merged) 10jenkins-bot: Enable beta mobile MMV on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303504 (https://phabricator.wikimedia.org/T426775) (owner: 10Eric Gardner) [23:19:48] !log egardner@deploy1003 Started scap sync-world: Backport for [[gerrit:1303504|Enable beta mobile MMV on Wikipedias (T426775)]] [23:19:52] T426775: [Image Browsing] Launch mobile MMV separate from the carousel - https://phabricator.wikimedia.org/T426775 [23:21:45] !log egardner@deploy1003 egardner: Backport for [[gerrit:1303504|Enable beta mobile MMV on Wikipedias (T426775)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:22:19] !log egardner@deploy1003 egardner: Continuing with deployment [23:26:35] !log egardner@deploy1003 Finished scap sync-world: Backport for [[gerrit:1303504|Enable beta mobile MMV on Wikipedias (T426775)]] (duration: 06m 46s) [23:26:40] T426775: [Image Browsing] Launch mobile MMV separate from the carousel - https://phabricator.wikimedia.org/T426775 [23:36:30] RECOVERY - Check unit status of security_group_ssh-from-restricted-bastion_to_project_zuul on cloudcontrol1006 is OK: OK: Status of the systemd unit security_group_ssh-from-restricted-bastion_to_project_zuul https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:39:09] (03PS1) 10Dzahn: gerrit: update mtail::program title to match alert metrics [puppet] - 10https://gerrit.wikimedia.org/r/1303587 [23:43:24] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#12032091 (10Fuyo21) 05Resolved→03Open Still happens on Russian wiki: "Timeline error. Could not store output files" https://ru.wikipedia.org/wiki/%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D... [23:44:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1303590 [23:44:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1303590 (owner: 10TrainBranchBot) [23:47:30] PROBLEM - Check unit status of security_group_ssh-from-restricted-bastion_to_project_zuul on cloudcontrol1006 is CRITICAL: CRITICAL: Status of the systemd unit security_group_ssh-from-restricted-bastion_to_project_zuul https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:53:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1303590 (owner: 10TrainBranchBot) [23:57:33] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 06Release-Engineering-Team (Radar), 07User-notice: Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#12032107 (10SomeRandomDeveloper) This seems to have caused {T429559}. I assume this was supposed to be addressed by {T42...