[00:02:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:23:42] (03CR) 10Dzahn: [V:03+1 C:03+2] "no worries - this file exists on the server but is currently NOT included in apache config (can be verified with 'apache2ctl -t -D DUMP_IN" [puppet] - 10https://gerrit.wikimedia.org/r/1300916 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:27:16] FIRING: [14x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [00:33:00] (03PS1) 10Dzahn: ci: load mod_ssl in httpd to be able to proxy https [puppet] - 10https://gerrit.wikimedia.org/r/1305531 (https://phabricator.wikimedia.org/T418521) [01:18:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [01:49:19] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:54:41] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:55:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:55:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:59:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:00:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:00:25] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:00:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:00:49] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 00m 24s) [02:04:41] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:05:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:29:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:29:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:29:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:44:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:44:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:44:55] !log pt1979@cumin1003 START - Cookbook sre.network.cf [02:44:56] !log pt1979@cumin1003 END (PASS) - Cookbook sre.network.cf (exit_code=0) [02:45:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:49:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:07:10] 10ops-eqiad, 10Cloud-VPS, 06DC-Ops, 06cloud-services-team (FY2025/2026-Q3-Q4): Ensure cloudvirt capacity is more evenly spread out among racks - https://phabricator.wikimedia.org/T424658#12052421 (10Jclark-ctr) [03:56:32] (03PS1) 10RLazarus: admin_ng: Fix comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305542 [04:02:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:52] (03CR) 10RLazarus: "Added in If15f9cc5, looks like just a mispaste." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305542 (owner: 10RLazarus) [04:20:12] (03PS1) 10RLazarus: admin_ng: Remove obsolete coredns 1.8.7-2 tag, unset everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305544 [04:20:12] (03PS1) 10RLazarus: coredns: Parameterize `name` and `k8s_app` [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305545 (https://phabricator.wikimedia.org/T427864) [04:20:13] (03PS1) 10RLazarus: coredns: Add an internal_only value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305546 (https://phabricator.wikimedia.org/T427864) [04:20:14] (03PS1) 10RLazarus: admin_ng: Install coredns-internalonly in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305547 (https://phabricator.wikimedia.org/T427864) [04:27:16] FIRING: [14x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [04:34:17] (03CR) 10RLazarus: coredns: Parameterize `name` and `k8s_app` (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305545 (https://phabricator.wikimedia.org/T427864) (owner: 10RLazarus) [05:12:02] (03PS1) 10Marostegui: db1290: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1305549 (https://phabricator.wikimedia.org/T429929) [05:13:03] (03CR) 10Marostegui: [C:03+2] db1290: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1305549 (https://phabricator.wikimedia.org/T429929) (owner: 10Marostegui) [05:13:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2233].codfw.wmnet,db[1217,1228,1290].eqiad.wmnet with reason: Primary switchover m2 T429929 [05:13:45] T429929: Switchover m2 master (db1228 -> db1290) - https://phabricator.wikimedia.org/T429929 [05:15:27] (03PS1) 10Marostegui: mariadb: Promote db1290 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1305550 (https://phabricator.wikimedia.org/T429929) [05:16:25] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1290 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1305550 (https://phabricator.wikimedia.org/T429929) (owner: 10Marostegui) [05:17:41] !log Failover m2 from db1228 to db1290 - T429929 [05:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:19:41] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:41] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:12] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [05:35:12] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [05:42:51] (03PS1) 10Marostegui: db1228: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1305562 (https://phabricator.wikimedia.org/T430106) [05:46:18] !log arnaudb@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade gitlab [05:50:28] (03CR) 10Marostegui: [C:03+2] db1228: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1305562 (https://phabricator.wikimedia.org/T430106) (owner: 10Marostegui) [05:58:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1228.eqiad.wmnet with reason: Reimage to Trixie [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T0600) [06:00:05] marostegui, Amir1, and federico3: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T0600). [06:01:44] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1228.eqiad.wmnet with OS trixie [06:01:50] arnaudb@cumin1003 arnaudb: The backup on gitlab1004 is complete, ready to proceed with upgrade. [06:04:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:07:58] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [06:08:58] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30036 bytes in 0.500 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [06:11:32] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade gitlab [06:12:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:14:23] (03PS1) 10Marostegui: db1290: Add master role [puppet] - 10https://gerrit.wikimedia.org/r/1305563 [06:15:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1228.eqiad.wmnet with reason: host reimage [06:16:28] (03CR) 10Marostegui: [C:03+2] db1290: Add master role [puppet] - 10https://gerrit.wikimedia.org/r/1305563 (owner: 10Marostegui) [06:17:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:19:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1228.eqiad.wmnet with reason: host reimage [06:24:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:26:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd105[3456] - https://phabricator.wikimedia.org/T419892#12052640 (10fgiunchedi) Good find @ayounsi ! Also with respect to racking these hosts, my preference would be to have one host per rack once we can do 25G... [06:28:06] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1305443 (https://phabricator.wikimedia.org/T430059) (owner: 10Muehlenhoff) [06:28:47] (03CR) 10Muehlenhoff: [C:03+2] Add Jesse to Bitu approvers [puppet] - 10https://gerrit.wikimedia.org/r/1305443 (https://phabricator.wikimedia.org/T430059) (owner: 10Muehlenhoff) [06:29:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:32:28] (03PS1) 10Ayounsi: depool-rack: properly call getattr() [cookbooks] - 10https://gerrit.wikimedia.org/r/1305564 (https://phabricator.wikimedia.org/T327300) [06:33:58] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#12052650 (10fgiunchedi) >>! In T401441#11986629, @VRiley-WMF wrote: > @fgiunchedi for these servers cloudcephosd1048, cloudcephosd1049, cloudcephosd1050, cloudcephosd1051 would we be able to sc... [06:35:14] (03CR) 10CI reject: [V:04-1] depool-rack: properly call getattr() [cookbooks] - 10https://gerrit.wikimedia.org/r/1305564 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [06:35:36] (03PS1) 10Muehlenhoff: Add Ahmon Dancy to releng-related approvals [puppet] - 10https://gerrit.wikimedia.org/r/1305566 [06:39:40] (03PS1) 10Muehlenhoff: Bitu: Add Ahmon Dancy as second approver for Spiderpig access [puppet] - 10https://gerrit.wikimedia.org/r/1305568 [06:40:35] (03PS1) 10Marostegui: installserver: Do not format clouddb102[6-7] [puppet] - 10https://gerrit.wikimedia.org/r/1305569 (https://phabricator.wikimedia.org/T409557) [06:40:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1228.eqiad.wmnet with OS trixie [06:41:43] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:45:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:46:43] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:49:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:55:46] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1305566 (owner: 10Muehlenhoff) [06:57:45] (03PS1) 10Arnaudb: gerrit: bump 4xx ratio to alert on [alerts] - 10https://gerrit.wikimedia.org/r/1305574 [06:57:53] (03CR) 10Arnaudb: [C:03+2] gerrit: bump 4xx ratio to alert on [alerts] - 10https://gerrit.wikimedia.org/r/1305574 (owner: 10Arnaudb) [07:00:05] Amir1, urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:30] (03Merged) 10jenkins-bot: gerrit: bump 4xx ratio to alert on [alerts] - 10https://gerrit.wikimedia.org/r/1305574 (owner: 10Arnaudb) [07:02:52] (03CR) 10Jelto: [C:03+1] "lgtm, thank you" [alerts] - 10https://gerrit.wikimedia.org/r/1305574 (owner: 10Arnaudb) [07:04:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1026.eqiad.wmnet with reason: Catching up [07:05:01] (03CR) 10Marostegui: [C:03+2] installserver: Do not format clouddb102[6-7] [puppet] - 10https://gerrit.wikimedia.org/r/1305569 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [07:07:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1228.eqiad.wmnet with reason: Cloning [07:10:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2234.codfw.wmnet,db1250.eqiad.wmnet with reason: Upgrading [07:11:34] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2234.codfw.wmnet with OS trixie [07:13:12] PROBLEM - MariaDB Replica IO: m3 on db2160 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2234.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2234.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [07:17:36] ^ known [07:18:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2160.codfw.wmnet with reason: Upgrading [07:19:51] (03PS1) 10Arnaudb: backups: edit gerrit fileset to exclude logs [puppet] - 10https://gerrit.wikimedia.org/r/1305578 (https://phabricator.wikimedia.org/T411583) [07:20:59] !log T423993: dropping ttmserver indices from the cirrussearch opensearch clusters [07:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:03] T423993: Upgrade old indices in the CirrusSearch opensearch clusters - https://phabricator.wikimedia.org/T423993 [07:23:53] (03PS2) 10Arnaudb: backups: edit gerrit fileset to exclude logs [puppet] - 10https://gerrit.wikimedia.org/r/1305578 (https://phabricator.wikimedia.org/T411583) [07:24:59] !log installing nginx security updates [07:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:42] (03CR) 10Arnaudb: [C:03+1] "looks good to me, ccing in @jwodstrcil@wikimedia.org @dzahn@wikimedia.org and @aokoth@wikimedia.org for information" [puppet] - 10https://gerrit.wikimedia.org/r/1305460 (https://phabricator.wikimedia.org/T430024) (owner: 10Brouberol) [07:29:01] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@86ab691] (releasing): T430110 Test on Jenkins secondary [07:29:32] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@86ab691] (releasing): T430110 Test on Jenkins secondary (duration: 00m 50s) [07:34:48] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12052815 (10MoritzMuehlenhoff) [07:35:16] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2234.codfw.wmnet with OS trixie [07:35:33] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2234.codfw.wmnet with OS trixie [07:41:17] 10ops-codfw, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116 (10Marostegui) 03NEW [07:41:45] 10ops-codfw, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12052845 (10Marostegui) p:05Triage→03Medium [07:41:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2160.codfw.wmnet with reason: Upgrading [07:43:05] (03CR) 10Muehlenhoff: profile::base::reboot_unattended: add class to mark hosts for unattended reboots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [07:47:44] !log filippo@cumin1003 START - Cookbook sre.dns.netbox [07:51:19] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on releases2003.codfw.wmnet with reason: T410849 [07:51:24] T410849: Update to Phorge/Arcanist upstream 2026-06-01 - https://phabricator.wikimedia.org/T410849 [07:52:08] !log filippo@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Allocate IPs for cloudvirt1077 - filippo@cumin1003" [07:52:12] !log filippo@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Allocate IPs for cloudvirt1077 - filippo@cumin1003" [07:52:12] !log filippo@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:02:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:52] !log marostegui@cumin1003 conftool action : set/weight=10; selector: name=clouddb1026.eqiad.wmnet [08:03:12] !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1026.eqiad.wmnet,service=s1 [08:03:45] !log Pool clouddb1026:s1 with a bit of weight T409557 [08:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:50] T409557: Productionize new clouddb* hosts (clouddb1022-1033) - https://phabricator.wikimedia.org/T409557 [08:07:08] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@ec879e3] (releasing): T430110 retry Jenkins secondary [08:07:37] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@ec879e3] (releasing): T430110 retry Jenkins secondary (duration: 00m 53s) [08:09:13] (03PS2) 10Ayounsi: depool-rack: properly call getattr() [cookbooks] - 10https://gerrit.wikimedia.org/r/1305564 (https://phabricator.wikimedia.org/T327300) [08:09:51] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: register phabricator in the external-services [puppet] - 10https://gerrit.wikimedia.org/r/1305449 (https://phabricator.wikimedia.org/T430024) (owner: 10Brouberol) [08:09:54] (03CR) 10Brouberol: [C:03+2] phabricator: enable egress from the dse kubepods networks [puppet] - 10https://gerrit.wikimedia.org/r/1305460 (https://phabricator.wikimedia.org/T430024) (owner: 10Brouberol) [08:10:12] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@ec879e3] (releasing): T430110 deploy to Jenkins primary [08:10:55] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@ec879e3] (releasing): T430110 deploy to Jenkins primary (duration: 00m 52s) [08:19:16] (03CR) 10Ayounsi: [C:03+2] depool-rack: properly call getattr() [cookbooks] - 10https://gerrit.wikimedia.org/r/1305564 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [08:19:26] (03CR) 10Ayounsi: [C:03+2] "Self merge as it's a minor fix" [cookbooks] - 10https://gerrit.wikimedia.org/r/1305564 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [08:22:01] (03Merged) 10jenkins-bot: depool-rack: properly call getattr() [cookbooks] - 10https://gerrit.wikimedia.org/r/1305564 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [08:27:16] FIRING: [14x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [08:27:21] (03CR) 10Jcrespo: [C:03+1] "Seems sensible, wanna me to merge it and check size change?" [puppet] - 10https://gerrit.wikimedia.org/r/1305578 (https://phabricator.wikimedia.org/T411583) (owner: 10Arnaudb) [08:27:28] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: introduce the wmfroot user (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [08:27:51] jouncebot: nowandnext [08:27:51] No deployments scheduled for the next 1 hour(s) and 32 minute(s) [08:27:51] In 1 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1000) [08:29:56] o/ I have some private code I'd like to deploy. Would now or some time today be acceptable? I have deployment rights and can self-service. Dreamy_Jazz has kindly agreed to help me if necessary. [08:32:08] (by which I mean if no one has any objections, I'll start as we're between windows) [08:33:08] (03PS12) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [08:33:08] (03PS1) 10Btullis: presto: Remove the queryType from the catch-all resource selector [puppet] - 10https://gerrit.wikimedia.org/r/1305596 (https://phabricator.wikimedia.org/T424112) [08:33:22] (03PS2) 10Btullis: presto: Remove the queryType from the catch-all resource selector [puppet] - 10https://gerrit.wikimedia.org/r/1305596 (https://phabricator.wikimedia.org/T424112) [08:33:26] (03PS1) 10Slyngshede: P:trafficserver::backend map thumb to swift backend [puppet] - 10https://gerrit.wikimedia.org/r/1305597 (https://phabricator.wikimedia.org/T427465) [08:33:30] (03PS13) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [08:33:44] (03CR) 10CI reject: [V:04-1] presto: Remove the queryType from the catch-all resource selector [puppet] - 10https://gerrit.wikimedia.org/r/1305596 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [08:35:26] (03PS1) 10Filippo Giunchedi: site: put cloudvirt1077 in service [puppet] - 10https://gerrit.wikimedia.org/r/1305598 (https://phabricator.wikimedia.org/T429563) [08:43:36] (03PS1) 10Jforrester: On AW article deletion, clear all AWArticleStore from sections and metadata [extensions/WikiLambda] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305599 (https://phabricator.wikimedia.org/T429873) [08:43:38] (03PS1) 10Jforrester: AWStorage: Use global stash keys [extensions/WikiLambda] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305600 (https://phabricator.wikimedia.org/T430060) [08:44:45] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1024-1025].eqiad.wmnet,db1155.eqiad.wmnet with reason: Reimaging db1221 [08:45:01] (03PS1) 10Elukey: sre.hosts.reimage: user ADMIN or root for ipmi/redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/1305602 (https://phabricator.wikimedia.org/T426180) [08:45:27] !log marostegui@cumin1003 conftool action : set/weight=30; selector: name=clouddb1026.eqiad.wmnet [08:46:34] Tran, Dreamy_Jazz: Are you finished? I have a back-port to deploy. [08:46:46] No it's in progress, testing right now [08:47:18] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging2006.codfw.wmnet with OS trixie [08:47:43] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:47:43] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [08:47:49] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [08:47:53] Ack. [08:48:03] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1221: Upgrading db1221.eqiad.wmnet [08:48:33] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1221: Upgrading db1221.eqiad.wmnet [08:48:54] (03CR) 10Atsuko: [C:03+1] presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [08:49:03] (03CR) 10Atsuko: [C:03+1] presto: Remove the queryType from the catch-all resource selector [puppet] - 10https://gerrit.wikimedia.org/r/1305596 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [08:50:56] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1305598 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [08:50:58] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1221.eqiad.wmnet with OS trixie [08:51:13] (03CR) 10Arnaudb: [C:03+2] "sounds good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1305578 (https://phabricator.wikimedia.org/T411583) (owner: 10Arnaudb) [08:52:01] (03CR) 10Filippo Giunchedi: [C:03+2] site: put cloudvirt1077 in service [puppet] - 10https://gerrit.wikimedia.org/r/1305598 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [08:52:23] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:52:23] !log cwilliams@cumin1003 dbmaint on s4@codfw T429893 [08:52:43] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2236: Upgrading db2236.codfw.wmnet [08:52:45] 06SRE, 10hCaptcha, 06Product Safety and Integrity: hcaptcha failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045#12053070 (10MoritzMuehlenhoff) >>! In T430045#12051694, @Scott_French wrote: > FWIW, it does not look like https://gerrit.wikimedia.org/r/c/operations/de... [08:53:05] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2236: Upgrading db2236.codfw.wmnet [08:54:41] (03CR) 10Btullis: [C:03+2] presto: Remove the queryType from the catch-all resource selector [puppet] - 10https://gerrit.wikimedia.org/r/1305596 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [08:54:52] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2236.codfw.wmnet with OS trixie [08:54:57] James_F, done. All you. [08:55:06] Thanks! [08:55:07] 06SRE, 10Citoid: citoid failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430053#12053086 (10MoritzMuehlenhoff) This is the same root cause as T430045, merging them. [08:55:29] 06SRE, 10hCaptcha, 06Product Safety and Integrity: hcaptcha failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045#12053091 (10MoritzMuehlenhoff) [08:55:30] 06SRE, 10Citoid: citoid failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430053#12053089 (10MoritzMuehlenhoff) →14Duplicate dup:03T430045 [08:55:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305599 (https://phabricator.wikimedia.org/T429873) (owner: 10Jforrester) [08:55:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305600 (https://phabricator.wikimedia.org/T430060) (owner: 10Jforrester) [08:55:48] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2234.codfw.wmnet with OS trixie [08:55:49] 06SRE, 10hCaptcha, 06Product Safety and Integrity: hcaptcha failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045#12053093 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:56:48] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:57:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:57:33] (03Merged) 10jenkins-bot: On AW article deletion, clear all AWArticleStore from sections and metadata [extensions/WikiLambda] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305599 (https://phabricator.wikimedia.org/T429873) (owner: 10Jforrester) [08:57:35] (03Merged) 10jenkins-bot: AWStorage: Use global stash keys [extensions/WikiLambda] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305600 (https://phabricator.wikimedia.org/T430060) (owner: 10Jforrester) [08:57:48] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:58:12] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1305599|On AW article deletion, clear all AWArticleStore from sections and metadata (T429873)]], [[gerrit:1305600|AWStorage: Use global stash keys (T430060)]] [08:58:18] T429873: Implement better deletion strategy for Abstract Content - https://phabricator.wikimedia.org/T429873 [08:58:18] T430060: AWArticleStore MainStash backend cross-wiki behavior not working as expected - https://phabricator.wikimedia.org/T430060 [08:58:44] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [09:00:21] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1305599|On AW article deletion, clear all AWArticleStore from sections and metadata (T429873)]], [[gerrit:1305600|AWStorage: Use global stash keys (T430060)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:00:46] !log jforrester@deploy1003 jforrester: Continuing with deployment [09:03:30] (03CR) 10Elukey: "Tested it with kafka-logging2006, a new host that doesn't have root deployed. All good :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1305602 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [09:03:31] (03CR) 10Jelto: [C:03+1] backups: edit gerrit fileset to exclude logs [puppet] - 10https://gerrit.wikimedia.org/r/1305578 (https://phabricator.wikimedia.org/T411583) (owner: 10Arnaudb) [09:03:52] (03PS11) 10Jelto: profile::base::reboot_unattended: add class to mark hosts for unattended reboots [puppet] - 10https://gerrit.wikimedia.org/r/1251406 [09:04:25] (03CR) 10Jelto: profile::base::reboot_unattended: add class to mark hosts for unattended reboots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [09:05:27] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2006.codfw.wmnet with reason: host reimage [09:05:41] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305599|On AW article deletion, clear all AWArticleStore from sections and metadata (T429873)]], [[gerrit:1305600|AWStorage: Use global stash keys (T430060)]] (duration: 07m 29s) [09:05:47] T429873: Implement better deletion strategy for Abstract Content - https://phabricator.wikimedia.org/T429873 [09:05:48] T430060: AWArticleStore MainStash backend cross-wiki behavior not working as expected - https://phabricator.wikimedia.org/T430060 [09:06:23] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1221.eqiad.wmnet with reason: host reimage [09:06:58] !log marostegui@cumin1003 conftool action : set/weight=100; selector: name=clouddb1026.eqiad.wmnet [09:07:14] All done at my end now. [09:08:01] (03PS1) 10Brouberol: service: register the phabricator service [puppet] - 10https://gerrit.wikimedia.org/r/1305606 (https://phabricator.wikimedia.org/T430024) [09:08:04] (03PS1) 10Brouberol: service_proxy: register phabricator services [puppet] - 10https://gerrit.wikimedia.org/r/1305607 (https://phabricator.wikimedia.org/T430024) [09:08:18] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 6 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [09:08:55] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2006.codfw.wmnet with reason: host reimage [09:10:33] (03CR) 10Muehlenhoff: profile::base::reboot_unattended: add class to mark hosts for unattended reboots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [09:11:11] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2236.codfw.wmnet with reason: host reimage [09:11:57] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1221.eqiad.wmnet with reason: host reimage [09:12:55] (03CR) 10CWilliams: cookbooks/sre/mysql/decommission: add cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [09:12:57] (03CR) 10AikoChou: [C:03+1] ml-services: Deploy the latest version of article-country model on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305392 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [09:12:57] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging2007.codfw.wmnet with OS trixie [09:13:38] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging2008.codfw.wmnet with OS trixie [09:15:53] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2236.codfw.wmnet with reason: host reimage [09:18:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:21:52] (03PS12) 10Jelto: profile::base::reboot_unattended: add class to mark hosts for unattended reboots [puppet] - 10https://gerrit.wikimedia.org/r/1251406 [09:24:10] (03CR) 10CWilliams: mysql: update replication source (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [09:26:05] (03CR) 10CWilliams: mysql: update replication source (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [09:26:43] (03PS1) 10Kosta Harlan: hCaptcha: Skip blocked-IP score collection for crawlers in VE and MF [extensions/ConfirmEdit] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305609 (https://phabricator.wikimedia.org/T429755) [09:26:45] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:26:54] jouncebot: nowandnext [09:26:54] No deployments scheduled for the next 0 hour(s) and 33 minute(s) [09:26:54] In 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1000) [09:26:55] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 6 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [09:27:05] deploying a wmf.8 patch [09:27:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305609 (https://phabricator.wikimedia.org/T429755) (owner: 10Kosta Harlan) [09:29:50] elukey@cumin1003 reimage (PID 2747312) is awaiting input [09:31:00] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1221.eqiad.wmnet with OS trixie [09:31:01] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2007.codfw.wmnet with reason: host reimage [09:31:45] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2008.codfw.wmnet with reason: host reimage [09:32:28] (03CR) 10CWilliams: mysql: update replication source (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [09:33:09] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2236.codfw.wmnet with OS trixie [09:34:45] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2007.codfw.wmnet with reason: host reimage [09:35:19] (03Merged) 10jenkins-bot: hCaptcha: Skip blocked-IP score collection for crawlers in VE and MF [extensions/ConfirmEdit] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305609 (https://phabricator.wikimedia.org/T429755) (owner: 10Kosta Harlan) [09:35:50] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1305609|hCaptcha: Skip blocked-IP score collection for crawlers in VE and MF (T429755)]] [09:35:54] T429755: hCaptcha: Exclude self-identified crawlers from IP blocked edit notice risk score collection - https://phabricator.wikimedia.org/T429755 [09:37:54] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1305609|hCaptcha: Skip blocked-IP score collection for crawlers in VE and MF (T429755)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:38:33] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2008.codfw.wmnet with reason: host reimage [09:39:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:39:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2006.codfw.wmnet with OS trixie [09:40:19] !log kharlan@deploy1003 kharlan: Continuing with deployment [09:43:25] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1221: Migration of db1221.eqiad.wmnet completed [09:44:36] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305609|hCaptcha: Skip blocked-IP score collection for crawlers in VE and MF (T429755)]] (duration: 08m 46s) [09:44:41] T429755: hCaptcha: Exclude self-identified crawlers from IP blocked edit notice risk score collection - https://phabricator.wikimedia.org/T429755 [09:45:53] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1305617 (https://phabricator.wikimedia.org/T430127) [09:45:56] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2236: Migration of db2236.codfw.wmnet completed [09:52:20] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 6648 [09:52:36] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6648 [09:53:15] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:56:19] elukey@cumin1003 reimage (PID 2751279) is awaiting input [09:58:02] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:58:02] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2007.codfw.wmnet with OS trixie [09:58:09] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:58:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:58:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2008.codfw.wmnet with OS trixie [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1000) [10:00:53] (03CR) 10Blake: [C:03+2] kubernetes: Add a k8s deployment for pretrain. [puppet] - 10https://gerrit.wikimedia.org/r/1305358 (https://phabricator.wikimedia.org/T427668) (owner: 10Blake) [10:02:21] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1305397 (https://phabricator.wikimedia.org/T420438) (owner: 10Klausman) [10:03:54] !log klausman@cumin2002 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: ml-staging-master@codfw [10:04:08] (03CR) 10Klausman: [C:03+2] hiera: Switch ml-staging k8s to Maglev LVS config [puppet] - 10https://gerrit.wikimedia.org/r/1305397 (https://phabricator.wikimedia.org/T420438) (owner: 10Klausman) [10:07:25] !log klausman@cumin2002 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for alias: ml-staging-master@codfw [10:09:18] (03PS1) 10Elukey: pontoon: add config for the kafka-upgrade stack used for testing [puppet] - 10https://gerrit.wikimedia.org/r/1305620 [10:09:26] i intend to use the infra window to do a non-build stop-before-sync deploy, in order to populate the release file for a new deployment (mw-pretrain) (T427668), and will proceed in 5m if there are no objections [10:09:26] T427668: Turn up the Pretrain MVP environment - https://phabricator.wikimedia.org/T427668 [10:09:58] !log klausman@cumin2002 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: ml-staging-master@codfw [10:10:32] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#12053340 (10elukey) All hosts reimaged, we should be good! [10:13:32] !log klausman@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [10:14:06] (03CR) 10Nikerabbit: [C:03+1] Drop fund, phortune, support [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300846 (https://phabricator.wikimedia.org/T418655) (owner: 10Pppery) [10:14:20] !log klausman@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [10:14:20] !log klausman@cumin2002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: ml-staging-master@codfw [10:15:19] !log blake@deploy1003 Started scap sync-world: Non-deployment scap run to populate new release values [10:15:45] !log blake@deploy1003 Stopping before sync operations [10:15:56] (03CR) 10Federico Ceratto: mysql: update replication source (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [10:16:43] !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1026.eqiad.wmnet,service=s1 [10:17:31] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Reimaging with a cookbook this kind of server with a wrong management password leads to exception - https://phabricator.wikimedia.org/T384462#12053382 (10CWilliams-WMF) [10:18:13] (03PS2) 10Elukey: pontoon: add config for the kafka-upgrade stack used for testing [puppet] - 10https://gerrit.wikimedia.org/r/1305620 [10:21:42] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Reimaging with a cookbook this kind of server with a wrong management password leads to exception - https://phabricator.wikimedia.org/T384462#12053392 (10CWilliams-WMF) @Volans I have linked the ticket that I created for a very similar scenario, now mar... [10:21:54] (03CR) 10Klausman: [C:03+2] role::ml_k8s::staging::worker: enable IPIP encapsulation [puppet] - 10https://gerrit.wikimedia.org/r/1294225 (https://phabricator.wikimedia.org/T420438) (owner: 10Elukey) [10:22:09] (03PS1) 10Bartosz Wójtowicz: rest-gateway: Add LiftWingLLM rate limit policy for LLM endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) [10:22:51] (03PS2) 10Elukey: Set ml-staging-ctrl to the Maglev scheduler and fix stale options [puppet] - 10https://gerrit.wikimedia.org/r/1294224 (https://phabricator.wikimedia.org/T420438) [10:23:09] (03CR) 10Klausman: [C:03+2] Set ml-staging-ctrl to the Maglev scheduler and fix stale options [puppet] - 10https://gerrit.wikimedia.org/r/1294224 (https://phabricator.wikimedia.org/T420438) (owner: 10Elukey) [10:23:15] (03CR) 10Klausman: [V:03+2 C:03+2] Set ml-staging-ctrl to the Maglev scheduler and fix stale options [puppet] - 10https://gerrit.wikimedia.org/r/1294224 (https://phabricator.wikimedia.org/T420438) (owner: 10Elukey) [10:25:06] (03CR) 10Klausman: [C:03+2] Set Maglev's scheduling for inference-staging and ingress [puppet] - 10https://gerrit.wikimedia.org/r/1294226 (https://phabricator.wikimedia.org/T420438) (owner: 10Elukey) [10:28:00] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088#12053445 (10elukey) @RKemper I added you to the `kafka-infrastructure` cloud project, you should see it in Horizon! At this point,... [10:28:56] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1221: Migration of db1221.eqiad.wmnet completed [10:28:57] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:31:04] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Reimaging with a cookbook this kind of server with a wrong management password leads to exception - https://phabricator.wikimedia.org/T384462#12053465 (10MoritzMuehlenhoff) This should soon no longer be an issue once https://phabricator.wikimedia.org/T4... [10:31:27] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2236: Migration of db2236.codfw.wmnet completed [10:31:28] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:32:12] 06SRE, 06Infrastructure-Foundations: Migrate remaining container build/report steps from to build2002 - https://phabricator.wikimedia.org/T417389#12053474 (10MoritzMuehlenhoff) I'll first add a new build host on trixie and then fail over to that instead. [10:34:55] (03PS1) 10Klausman: role/ml_k8s/staging/worker: add IPIP role [puppet] - 10https://gerrit.wikimedia.org/r/1305623 (https://phabricator.wikimedia.org/T42043) [10:35:45] 06SRE, 06Infrastructure-Foundations: Migrate remaining container build/report steps from to build2003 - https://phabricator.wikimedia.org/T417389#12053489 (10MoritzMuehlenhoff) [10:36:02] (03CR) 10Ozge: [C:03+1] ml-services: Deploy the latest version of article-country model on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305392 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [10:37:14] !log jmm@cumin2003 START - Cookbook sre.ganeti.makevm for new host build2003.codfw.wmnet [10:37:16] !log jmm@cumin2003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host build2003.codfw.wmnet [10:37:21] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8781/co" [puppet] - 10https://gerrit.wikimedia.org/r/1305623 (https://phabricator.wikimedia.org/T42043) (owner: 10Klausman) [10:38:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host build2003.codfw.wmnet [10:38:57] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host build2003.codfw.wmnet [10:41:06] 06SRE, 06Infrastructure-Foundations: Migrate remaining container build/report steps from to build2004 - https://phabricator.wikimedia.org/T417389#12053500 (10MoritzMuehlenhoff) [10:42:48] (03PS1) 10Muehlenhoff: Add build2004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1305627 (https://phabricator.wikimedia.org/T417389) [10:43:01] (03PS2) 10Muehlenhoff: Add build2004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1305627 (https://phabricator.wikimedia.org/T417389) [10:43:52] (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304147 (owner: 10PipelineBot) [10:45:32] (03CR) 10Elukey: [C:03+1] "recheck" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298541 (owner: 10Volans) [10:46:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [10:46:33] (03CR) 10Daniel Kinzler: "That looks about right at a glance" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [10:49:49] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy the latest version of article-country model on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305392 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [10:50:29] (03CR) 10CI reject: [V:04-1] config: type config_file as PathLike[str] [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298541 (owner: 10Volans) [10:52:01] (03Merged) 10jenkins-bot: ml-services: Deploy the latest version of article-country model on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305392 (https://phabricator.wikimedia.org/T429675) (owner: 10Gkyziridis) [10:55:46] (03CR) 10Muehlenhoff: [C:03+2] Add build2004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1305627 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [10:57:27] (03CR) 10Ilias Sarantopoulos: "lgtm! I'll defer to claime for the technical review but I can comment that I agree on the limits and the grouping as it is in line with wh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [11:00:12] (03CR) 10Clément Goubert: [C:03+1] "Looks right, just a quick question, will qwen314b be moved under that path?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [11:00:39] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Reimaging with a cookbook this kind of server with a wrong management password leads to exception - https://phabricator.wikimedia.org/T384462#12053564 (10CWilliams-WMF) @MoritzMuehlenhoff thanks for that! [11:02:19] (03PS1) 10JavierMonton: k8s namespace: webrequest-page-trending [puppet] - 10https://gerrit.wikimedia.org/r/1305630 (https://phabricator.wikimedia.org/T430136) [11:05:00] !log jforrester@deploy1003: mwscript sql.php --wiki=wikifunctionswiki --cluster extension1 extensions/WikiLambda/sql/mysql/table-wikifunctions_usage.sql # T428667 [11:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:04] T428667: Create a new x1 tables for cross-wiki tracking of Wikifunctions usage, similar to GlobalUsage - https://phabricator.wikimedia.org/T428667 [11:05:07] !log jforrester@deploy1003: mwscript sql.php --wiki=wikifunctionswiki --cluster extension1 extensions/WikiLambda/sql/mysql/table-wikifunctions_usage_wikis.sql # T428667 [11:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:02] (03PS1) 10JavierMonton: namespaces: webrequest-page-trending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305632 (https://phabricator.wikimedia.org/T430136) [11:09:13] !log jmm@cumin2003 START - Cookbook sre.ganeti.makevm for new host build2004.codfw.wmnet [11:09:16] !log jmm@cumin2003 START - Cookbook sre.dns.netbox [11:10:27] (03CR) 10Jelto: [V:03+1 C:03+2] profile::base::reboot_unattended: add class to mark hosts for unattended reboots (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [11:11:42] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1208) - https://phabricator.wikimedia.org/T430138 (10LSobanski) 03NEW [11:12:12] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T430139 (10LSobanski) 03NEW [11:14:28] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [11:14:39] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [11:15:09] (03PS2) 10JavierMonton: namespaces: webrequest-page-trending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305632 (https://phabricator.wikimedia.org/T430136) [11:15:23] jmm@cumin2003 makevm (PID 317727) is awaiting input [11:16:45] (03PS3) 10JavierMonton: namespaces: pageview-trending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305632 (https://phabricator.wikimedia.org/T430136) [11:17:02] (03PS2) 10JavierMonton: k8s namespace: pageview-trending [puppet] - 10https://gerrit.wikimedia.org/r/1305630 (https://phabricator.wikimedia.org/T430136) [11:24:02] (03PS1) 10Fabfur: cache::haproxy: add correlation id feature [puppet] - 10https://gerrit.wikimedia.org/r/1305635 (https://phabricator.wikimedia.org/T426379) [11:25:34] !log jmm@cumin2003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM build2004.codfw.wmnet - jmm@cumin2003" [11:25:38] !log jmm@cumin2003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM build2004.codfw.wmnet - jmm@cumin2003" [11:25:38] !log jmm@cumin2003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:25:39] !log jmm@cumin2003 START - Cookbook sre.dns.wipe-cache build2004.codfw.wmnet on all recursors [11:25:42] !log jmm@cumin2003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) build2004.codfw.wmnet on all recursors [11:26:13] !log jmm@cumin2003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM build2004.codfw.wmnet - jmm@cumin2003" [11:26:18] !log jmm@cumin2003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM build2004.codfw.wmnet - jmm@cumin2003" [11:27:36] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305635 (https://phabricator.wikimedia.org/T426379) (owner: 10Fabfur) [11:28:28] !log jmm@cumin2003 START - Cookbook sre.hosts.reimage for host build2004.codfw.wmnet with OS trixie [11:40:08] (03CR) 10Bartosz Wójtowicz: "Yes, the plan is to move qwen3-14b and other LLMs we would like to expose under this path, it'd be done in follow-up patches." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [11:44:01] 06SRE, 06Infrastructure-Foundations: Migrate remaining container build/report steps from build2001 to build2004 - https://phabricator.wikimedia.org/T417389#12053767 (10LSobanski) [11:45:02] (03CR) 10Clément Goubert: [C:03+1] "Then be mindful that the pipe caching will only match `openai/v1` https://gerrit.wikimedia.org/r/c/operations/puppet/+/1293746" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [11:47:36] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1305602 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [11:48:38] !log jmm@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on build2004.codfw.wmnet with reason: host reimage [11:50:33] !log installing harfbuzz security updates [11:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:35] !log jmm@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on build2004.codfw.wmnet with reason: host reimage [11:57:09] (03PS1) 10Dpogorzelski: ml-serve: add GPU partition allocation metrics [puppet] - 10https://gerrit.wikimedia.org/r/1305643 (https://phabricator.wikimedia.org/T429597) [11:57:21] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lsw1-a7-codfw,lsw1-a7-codfw IPv6,lsw1-a7-codfw.mgmt with reason: Switch maintenance [11:57:30] (03CR) 10CI reject: [V:04-1] ml-serve: add GPU partition allocation metrics [puppet] - 10https://gerrit.wikimedia.org/r/1305643 (https://phabricator.wikimedia.org/T429597) (owner: 10Dpogorzelski) [11:58:33] (03PS2) 10Dpogorzelski: ml-serve: add GPU partition allocation metrics [puppet] - 10https://gerrit.wikimedia.org/r/1305643 (https://phabricator.wikimedia.org/T429597) [11:58:47] (03PS1) 10Mareike Heuer: Remove wgCiteRemoveSyntheticRefsUnsafe feature flag from production and beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305639 (https://phabricator.wikimedia.org/T428232) [11:59:39] (03PS1) 10Muehlenhoff: Add library hint for harfbuff [puppet] - 10https://gerrit.wikimedia.org/r/1305644 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1200) [12:02:21] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for harfbuff [puppet] - 10https://gerrit.wikimedia.org/r/1305644 (owner: 10Muehlenhoff) [12:02:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:03:21] (03CR) 10Jforrester: "Note that these tables are now live, so this would be good to land soon to avoid alerts." [puppet] - 10https://gerrit.wikimedia.org/r/1305102 (https://phabricator.wikimedia.org/T428667) (owner: 10Jforrester) [12:04:35] !log ayounsi@cumin1003 START - Cookbook sre.network.depool-rack with action 'depool' for codfw rack A7 [12:05:02] (03PS2) 10Bartosz Wójtowicz: rest-gateway: Add LiftWingLLM rate limit policy for LLM endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) [12:06:25] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 31 hosts with reason: Rack A7 depool [12:07:04] (03CR) 10Bartosz Wójtowicz: "Noted, our endpoints are all under `openai/v1` so to be consistent I've narrowed the route's regex match to only `openai/v1` to match the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [12:08:01] !log aokoth@cumin1003 START - Cookbook sre.hosts.decommission for hosts phab2002.codfw.wmnet [12:08:24] !log ayounsi@cumin1003 START - Cookbook sre.mysql.depool depool db2224: rack depool [12:08:44] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2224: rack depool [12:09:02] !log ayounsi@cumin1003 START - Cookbook sre.mysql.depool depool db2225: rack depool [12:09:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2225: rack depool [12:09:36] !log ayounsi@cumin1003 START - Cookbook sre.mysql.depool depool es2045: rack depool [12:09:57] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2045: rack depool [12:12:39] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:12:39] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [12:12:46] (03PS1) 10Atsuko: data-platform/k8s: monitor for unreleased k8s changes [alerts] - 10https://gerrit.wikimedia.org/r/1305646 (https://phabricator.wikimedia.org/T423078) [12:12:51] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-wdqs2003.codfw.wmnet,dse-k8s-wdqs-test2001.codfw.wmnet [12:12:53] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-wdqs2003.codfw.wmnet,dse-k8s-wdqs-test2001.codfw.wmnet [12:12:59] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1243: Upgrading db1243.eqiad.wmnet [12:13:14] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2258-2259].codfw.wmnet [12:13:17] !log aokoth@cumin1003 START - Cookbook sre.dns.netbox [12:13:20] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1243: Upgrading db1243.eqiad.wmnet [12:13:52] !log cwilliams@cumin1003 START - Cookbook sre.hosts.remove-downtime for an-redacteddb1001.eqiad.wmnet,clouddb[1015,1024-1025].eqiad.wmnet,db1155.eqiad.wmnet [12:13:55] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-redacteddb1001.eqiad.wmnet,clouddb[1015,1024-1025].eqiad.wmnet,db1155.eqiad.wmnet [12:14:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2258-2259].codfw.wmnet [12:14:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.depool-rack (exit_code=0) with action 'depool' for codfw rack A7 [12:14:49] !log jmm@cumin2003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on logstash2023.codfw.wmnet with reason: A7 maintenace [12:15:47] !log jmm@cumin2003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd2003.codfw.wmnet with reason: A7 maintenace [12:15:54] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#12053887 (10Jhancock.wm) [12:15:59] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1243.eqiad.wmnet with OS trixie [12:16:10] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:16:10] !log cwilliams@cumin1003 dbmaint on s4@codfw T429893 [12:16:12] !log jmm@cumin2003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dse-k8s-etcd2001.codfw.wmnet with reason: A7 maintenace [12:16:20] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2237: Upgrading db2237.codfw.wmnet [12:16:36] !log jmm@cumin2003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagemaster2005.codfw.wmnet with reason: A7 maintenace [12:16:42] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2237: Upgrading db2237.codfw.wmnet [12:17:23] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1305648 (owner: 10L10n-bot) [12:18:16] !log aokoth@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: phab2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1003" [12:18:25] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2237.codfw.wmnet with OS trixie [12:19:15] !log aokoth@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: phab2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1003" [12:19:15] !log aokoth@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:19:16] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts phab2002.codfw.wmnet [12:20:33] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit2003), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:24:11] (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [alerts] - 10https://gerrit.wikimedia.org/r/1305646 (https://phabricator.wikimedia.org/T423078) (owner: 10Atsuko) [12:26:12] (03PS1) 10AOkoth: site: remove phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/1305661 (https://phabricator.wikimedia.org/T423727) [12:27:16] FIRING: [14x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [12:28:56] (03CR) 10Hashar: [C:03+1] ci: load mod_ssl in httpd to be able to proxy https [puppet] - 10https://gerrit.wikimedia.org/r/1305531 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [12:29:45] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#12053940 (10Jhancock.wm) 05Open→03Resolved [12:31:39] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1019 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:31:42] (03CR) 10Atsuko: [C:03+2] data-platform/k8s: monitor for unreleased k8s changes [alerts] - 10https://gerrit.wikimedia.org/r/1305646 (https://phabricator.wikimedia.org/T423078) (owner: 10Atsuko) [12:32:33] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1019 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:32:46] (03PS5) 10Tiziano Fogli: mirrormaker: move alert defs on profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) [12:33:07] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1243.eqiad.wmnet with reason: host reimage [12:33:17] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: user ADMIN or root for ipmi/redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/1305602 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [12:33:48] (03Merged) 10jenkins-bot: data-platform/k8s: monitor for unreleased k8s changes [alerts] - 10https://gerrit.wikimedia.org/r/1305646 (https://phabricator.wikimedia.org/T423078) (owner: 10Atsuko) [12:35:01] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2237.codfw.wmnet with reason: host reimage [12:37:35] federico3, _joe_, I'm going to reboot rack A7 switch for maintenance, all servers are depooled, the only unknown is cephosd2001, I can't get a hold on anyone to know if it needs a depool or not, but afaik those services are fault tolerant. Everything is downtimed. [12:38:16] <_joe_> uhm chephosd I suppose is DPE SRE? [12:38:28] <_joe_> btullis / brouberol, any idea? [12:38:29] XioNoX: these are maintained by Data Platform, I usually ping Ben for things [12:38:45] _joe_: yeah, I pinged everyone multiple times [12:38:56] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1243.eqiad.wmnet with reason: host reimage [12:39:22] <_joe_> gehel: ^^ please advise [12:39:31] XioNoX: thanks, ack [12:39:53] (03PS1) 10Elukey: Pin pytest version and fix mypy errors in config.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1305665 [12:40:42] <_joe_> XioNoX: give me a couple minutes, I'm pinging people on slack [12:40:50] _joe_: thanks for the help! [12:41:46] we should be good, but let me check [12:41:58] (03CR) 10Klausman: [C:03+1] ml-serve: add GPU partition allocation metrics [puppet] - 10https://gerrit.wikimedia.org/r/1305643 (https://phabricator.wikimedia.org/T429597) (owner: 10Dpogorzelski) [12:42:51] <3 [12:43:00] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2237.codfw.wmnet with reason: host reimage [12:43:18] (03CR) 10Elukey: "This is the starting point to get CI working again, then we'll be able to rebase Riccardo's and Jesse's patches on top. Lemme know!" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1305665 (owner: 10Elukey) [12:43:26] XioNoX: we're good, you can go ahead! [12:43:48] <_joe_> :) thanks gehel [12:44:06] awesome, thanks! [12:44:31] !log lsw1-a7-codfw> request system reboot - T429817 [12:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:36] T429817: codfw: rack A7 maintenance - https://phabricator.wikimedia.org/T429817 [12:44:40] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: add GPU partition allocation metrics [puppet] - 10https://gerrit.wikimedia.org/r/1305643 (https://phabricator.wikimedia.org/T429597) (owner: 10Dpogorzelski) [12:46:01] (03CR) 10Volans: [C:04-1] "LGTM but we need to support bullseye too" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1305665 (owner: 10Elukey) [12:47:10] (03PS2) 10Elukey: Pin pytest version and fix mypy errors in config.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1305665 [12:47:18] (03CR) 10Elukey: Pin pytest version and fix mypy errors in config.py (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1305665 (owner: 10Elukey) [12:47:48] PROBLEM - BFD status on ssw1-a8-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:47:48] PROBLEM - BFD status on ssw1-a1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:48:54] (03CR) 10Volans: [C:03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1305665 (owner: 10Elukey) [12:49:14] _joe_ XioNoX sorry I was OOO for a bit [12:49:22] reading the backscroll [12:49:24] (03PS1) 10Filippo Giunchedi: hieradata: add nova id for cloudvirt1077 [puppet] - 10https://gerrit.wikimedia.org/r/1305668 (https://phabricator.wikimedia.org/T429563) [12:49:25] <_joe_> brouberol: I doubt you can be forgiven [12:49:41] <_joe_> to the gallows! [12:49:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a1-codfw:et-0/0/6 (Core: lsw1-a7-codfw:et-0/0/55 {#230403800019}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:49:54] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and lsw1-a7-codfw (10.192.252.9) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:49:55] XioNoX: feel free to reboot the rack. That ceph cluster is un-used atm anyway [12:50:15] I'll see myself to the gallows then [12:51:20] <_joe_> brouberol: sorry, I don't make the rules, or the business needs. [12:51:40] shot taken [12:51:55] <_joe_> :always_has_been: [12:52:17] (03CR) 10Filippo Giunchedi: "root@cloudvirt1077:~# cat /etc/nova/compute_id" [puppet] - 10https://gerrit.wikimedia.org/r/1305668 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [12:52:38] (03CR) 10Elukey: [C:03+2] Pin pytest version and fix mypy errors in config.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1305665 (owner: 10Elukey) [12:54:41] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:56:48] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1243.eqiad.wmnet with OS trixie [12:57:37] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [12:57:44] switch is back up [12:57:47] (03CR) 10Volans: [C:03+2] hieradata: add nova id for cloudvirt1077 [puppet] - 10https://gerrit.wikimedia.org/r/1305668 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [12:57:50] RECOVERY - BFD status on ssw1-a8-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:57:50] RECOVERY - BFD status on ssw1-a1-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:58:03] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1305668 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [12:58:19] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: add nova id for cloudvirt1077 [puppet] - 10https://gerrit.wikimedia.org/r/1305668 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [12:58:31] going to proceed with the repool very soon [12:58:38] (03PS1) 10Anzx: isvwiki: set timezone, sitename and logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305476 (https://phabricator.wikimedia.org/T429935) [12:59:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305476 (https://phabricator.wikimedia.org/T429935) (owner: 10Anzx) [12:59:39] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2237.codfw.wmnet with OS trixie [12:59:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:59:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a1-codfw:et-0/0/6 (Core: lsw1-a7-codfw:et-0/0/55 {#230403800019}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:59:54] RESOLVED: [2x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and lsw1-a7-codfw (10.192.252.9) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:00:04] Lucas_WMDE, urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1300). [13:00:05] anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] !log filippo@cumin1003 START - Cookbook sre.hosts.reimage for host cloudvirt1077.eqiad.wmnet with OS trixie [13:00:17] o/ [13:00:28] (03PS6) 10Tiziano Fogli: mirrormaker: move alert defs on profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) [13:01:25] FIRING: SystemdUnitFailed: netbox_ganeti_codfw_test_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:34] !log installing glib2.0 security updates [13:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:37] RESOLVED: [6x] CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [13:02:56] I can deploy [13:03:35] moritzm: you can repool A7 ganeti [13:03:40] XioNoX: on it [13:04:12] (03CR) 10Zabe: [C:03+2] isvwiki: set timezone, sitename and logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305476 (https://phabricator.wikimedia.org/T429935) (owner: 10Anzx) [13:04:47] (03CR) 10Zabe: [C:03+2] Use Hadoop for Mostcategories on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248909 (https://phabricator.wikimedia.org/T413362) (owner: 10Zabe) [13:04:59] (03PS1) 10Ottomata: html_content_change - bump image to v1.56.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305672 (https://phabricator.wikimedia.org/T427598) [13:05:12] (03Merged) 10jenkins-bot: isvwiki: set timezone, sitename and logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305476 (https://phabricator.wikimedia.org/T429935) (owner: 10Anzx) [13:05:27] !log jmm@cumin2003 START - Cookbook sre.ganeti.addnode for new host ganeti2028.codfw.wmnet to cluster codfw and group A [13:05:29] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1305620 (owner: 10Elukey) [13:05:42] (03Merged) 10jenkins-bot: Use Hadoop for Mostcategories on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248909 (https://phabricator.wikimedia.org/T413362) (owner: 10Zabe) [13:06:15] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1305476|isvwiki: set timezone, sitename and logos (T429935)]], [[gerrit:1248909|Use Hadoop for Mostcategories on commonswiki (T413362)]] [13:06:17] !log re-added ganeti2028 to codfw/A Ganeti cluster T429817 [13:06:22] T429935: Post-creation work for isvwiki - https://phabricator.wikimedia.org/T429935 [13:06:24] T413362: Move Mostcategories computation to Hadoop - https://phabricator.wikimedia.org/T413362 [13:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:28] T429817: codfw: rack A7 maintenance - https://phabricator.wikimedia.org/T429817 [13:07:11] (03CR) 10Ottomata: [C:03+2] html_content_change - bump image to v1.56.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305672 (https://phabricator.wikimedia.org/T427598) (owner: 10Ottomata) [13:07:57] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2258-2259].codfw.wmnet [13:07:59] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2258-2259].codfw.wmnet [13:08:23] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-wdqs2003.codfw.wmnet,dse-k8s-wdqs-test2001.codfw.wmnet [13:08:24] !log zabe@deploy1003 zabe, anzx: Backport for [[gerrit:1305476|isvwiki: set timezone, sitename and logos (T429935)]], [[gerrit:1248909|Use Hadoop for Mostcategories on commonswiki (T413362)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:08:25] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-wdqs2003.codfw.wmnet,dse-k8s-wdqs-test2001.codfw.wmnet [13:08:36] looking [13:09:16] !log ayounsi@cumin1003 START - Cookbook sre.mysql.pool pool db2225: rack depool [13:09:17] zabe: looks good, ok to sync [13:09:18] (03Merged) 10jenkins-bot: html_content_change - bump image to v1.56.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305672 (https://phabricator.wikimedia.org/T427598) (owner: 10Ottomata) [13:09:21] Thanks! [13:09:25] !log zabe@deploy1003 zabe, anzx: Continuing with deployment [13:09:45] !log ayounsi@cumin1003 START - Cookbook sre.mysql.pool pool db2224: rack depool [13:10:25] jmm@cumin2003 addnode (PID 352666) is awaiting input [13:11:30] zabe: please run namespacedupes.php for isvwiki after sync [13:11:32] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1243: Migration of db1243.eqiad.wmnet completed [13:11:37] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-wdqs-test2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:12:05] !log filippo@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1077.eqiad.wmnet with reason: host reimage [13:13:37] !log cscott@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [13:13:45] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305476|isvwiki: set timezone, sitename and logos (T429935)]], [[gerrit:1248909|Use Hadoop for Mostcategories on commonswiki (T413362)]] (duration: 07m 30s) [13:13:54] T429935: Post-creation work for isvwiki - https://phabricator.wikimedia.org/T429935 [13:13:54] T413362: Move Mostcategories computation to Hadoop - https://phabricator.wikimedia.org/T413362 [13:14:04] !log cscott@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [13:14:06] !log cscott@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:14:10] FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:14:25] (03CR) 10CWilliams: mysql: update replication source (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [13:14:34] !log cscott@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [13:15:11] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2237: Migration of db2237.codfw.wmnet completed [13:16:22] !log zabe@deploy1003:~$ mwscript namespaceDupes.php isvwiki --fix # T429935 [13:16:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw_test_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:52] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:16:52] zabe: thanks for deploying [13:16:57] yw [13:16:57] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:17:21] (03CR) 10Bking: [C:03+2] opensearch: split plugins_mandatory into own key [puppet] - 10https://gerrit.wikimedia.org/r/1305321 (https://phabricator.wikimedia.org/T429844) (owner: 10Ryan Kemper) [13:17:35] zabe@deploy1003:~$ echo 'https://en.wikipedia.org/static/images/project-logos/isvwiki.png' | mwscript-k8s --attach purgeList.php -- --wiki enwiki # T429935 [13:18:03] (probably not needed since it is new and not a change) [13:19:00] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1077.eqiad.wmnet with reason: host reimage [13:19:10] RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:20:32] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:21:01] !log jmm@cumin2003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2028.codfw.wmnet to cluster codfw and group A [13:27:56] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-wdqs-test2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:28:34] !log cscott@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [13:28:36] !log cscott@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [13:28:37] !log cscott@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:28:41] !log cscott@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [13:28:45] (03PS4) 10JavierMonton: namespaces: pageview-trending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305632 (https://phabricator.wikimedia.org/T430136) [13:29:06] (03PS7) 10Tiziano Fogli: mirrormaker: move alert defs on profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) [13:29:33] (03PS3) 10JavierMonton: k8s namespace: pageview-trending [puppet] - 10https://gerrit.wikimedia.org/r/1305630 (https://phabricator.wikimedia.org/T430136) [13:29:40] (03CR) 10CI reject: [V:04-1] mirrormaker: move alert defs on profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [13:33:58] (03PS8) 10Tiziano Fogli: mirrormaker: move alert defs on profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) [13:37:36] !log installing imagemagick security updates [13:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:52] (03PS1) 10Muehlenhoff: Add Hiera config for build2004 [puppet] - 10https://gerrit.wikimedia.org/r/1305677 (https://phabricator.wikimedia.org/T417389) [13:43:20] (03CR) 10Elukey: [C:03+1] Add Hiera config for build2004 [puppet] - 10https://gerrit.wikimedia.org/r/1305677 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [13:47:28] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1011 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:47:56] (03CR) 10Muehlenhoff: [C:03+2] Add Hiera config for build2004 [puppet] - 10https://gerrit.wikimedia.org/r/1305677 (https://phabricator.wikimedia.org/T417389) (owner: 10Muehlenhoff) [13:48:28] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1011 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:50:55] (03CR) 10Andrew Bogott: [C:03+1] "I have cherry-picked this to the cloud-vps puppetserver; I don't like having local patches there so we need to decide whether to merge or " [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [13:51:12] !log jmm@cumin2003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host build2004.codfw.wmnet with OS trixie [13:51:12] !log jmm@cumin2003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host build2004.codfw.wmnet [13:53:27] (03PS1) 10Muehlenhoff: Iniitally install build2004 with insetup [puppet] - 10https://gerrit.wikimedia.org/r/1305678 [13:53:48] (03PS1) 10AikoChou: ml-services: bump event-emitting isvc image tags in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305679 (https://phabricator.wikimedia.org/T421237) [13:54:44] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2225: rack depool [13:55:13] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2224: rack depool [13:57:03] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1243: Migration of db1243.eqiad.wmnet completed [13:57:04] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [14:00:41] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2237: Migration of db2237.codfw.wmnet completed [14:00:42] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [14:00:44] (03PS1) 10Dpogorzelski: ml-serve: GPU partitions by size and MI210 support [puppet] - 10https://gerrit.wikimedia.org/r/1305680 (https://phabricator.wikimedia.org/T429597) [14:01:54] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: GPU partitions by size and MI210 support [puppet] - 10https://gerrit.wikimedia.org/r/1305680 (https://phabricator.wikimedia.org/T429597) (owner: 10Dpogorzelski) [14:04:02] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1077.eqiad.wmnet with OS trixie [14:05:15] !log Ran `delete from cuci_user where ciu_ciwm_id = 4;` for T430156 [14:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:19] T430156: Wikimedia\Rdbms\DBQueryError: Error 1049: Unknown database 'apiportalwiki'Function: Wikimedia\Rdbms\DatabaseMySQL::doSelectDomainQuery: USE `apiportalwiki` - https://phabricator.wikimedia.org/T430156 [14:08:35] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:08:35] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [14:08:42] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [14:08:56] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1247: Upgrading db1247.eqiad.wmnet [14:09:26] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1247: Upgrading db1247.eqiad.wmnet [14:11:32] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1247.eqiad.wmnet with OS trixie [14:11:50] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:11:50] !log cwilliams@cumin1003 dbmaint on s4@codfw T429893 [14:12:11] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2245: Upgrading db2245.codfw.wmnet [14:12:33] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2245: Upgrading db2245.codfw.wmnet [14:14:03] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2245.codfw.wmnet with OS trixie [14:19:57] (03CR) 10Muehlenhoff: [C:03+2] Iniitally install build2004 with insetup [puppet] - 10https://gerrit.wikimedia.org/r/1305678 (owner: 10Muehlenhoff) [14:21:02] !log Restarting CI Jenkins on contint1002 [14:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:03] !log jmm@cumin2003 START - Cookbook sre.hosts.reimage for host build2004.codfw.wmnet with OS trixie [14:28:31] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1247.eqiad.wmnet with reason: host reimage [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1430) [14:30:32] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2245.codfw.wmnet with reason: host reimage [14:34:20] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1247.eqiad.wmnet with reason: host reimage [14:34:38] 06SRE, 10SRE-Access-Requests: Requesting access for lerickson to deploy the RDF streaming updater on wikikube - https://phabricator.wikimedia.org/T429610#12054485 (10thcipriani) >>! In T429610#12039261, @MoritzMuehlenhoff wrote: > @thcipriani This needs your approval for the deployment group. Sorry for delay,... [14:36:22] !log Drop database apiportalwiki on sanitarium and wikireplicas T430102 [14:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:27] T430102: Delete apiportalwiki from wikireplicas - https://phabricator.wikimedia.org/T430102 [14:36:32] (03CR) 10Ahmon Dancy: "Thanks Andrew. It worked! So, I'll submit a new patchset which removes the deployment-dancy* hostname test condition." [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [14:37:28] (03CR) 10Fabfur: [C:03+1] role/ml_k8s/staging/worker: add IPIP role [puppet] - 10https://gerrit.wikimedia.org/r/1305623 (https://phabricator.wikimedia.org/T42043) (owner: 10Klausman) [14:38:07] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2245.codfw.wmnet with reason: host reimage [14:38:18] jouncebot: nowandnext [14:38:18] For the next 0 hour(s) and 21 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1430) [14:38:18] In 0 hour(s) and 21 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1500) [14:39:51] 06SRE, 06Infrastructure-Foundations: Adding Jesse to approvers for Bitu - https://phabricator.wikimedia.org/T430059#12054505 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Access has been setup and confirmed to be working fine. [14:40:07] (03CR) 10Hnowlan: [C:03+2] redis: disable nrpe checks, replace with prometheus checks [puppet] - 10https://gerrit.wikimedia.org/r/1305347 (https://phabricator.wikimedia.org/T384924) (owner: 10Tiziano Fogli) [14:40:25] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12054512 (10Jhancock.wm) @Marostegui that's correct. i upgraded the bios (and the idrac) firmware and it's fixed the issue. it did boot rather than reimage. So you might have to restart whatever you... [14:42:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2234.codfw.wmnet with OS trixie [14:42:30] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12054522 (10Marostegui) Thank you @Jhancock.wm - just restarted the reimage! [14:43:09] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12054529 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:43:21] (03PS5) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server mismatch [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [14:43:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304888 (https://phabricator.wikimedia.org/T429830) (owner: 10Arlolra) [14:47:48] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new: Setup url-downloader-next.w.o to simply tests - https://phabricator.wikimedia.org/T430166 (10MoritzMuehlenhoff) 03NEW [14:48:05] !log jmm@cumin2003 START - Cookbook sre.hosts.downtime for 2:00:00 on build2004.codfw.wmnet with reason: host reimage [14:48:13] 06SRE, 10hCaptcha, 06Product Safety and Integrity: hcaptcha failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045#12054558 (10MoritzMuehlenhoff) >>! In T430045#12053070, @MoritzMuehlenhoff wrote: > One other actionable is to add a new CNAME url-downloader-next, which... [14:51:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1247.eqiad.wmnet with OS trixie [14:52:04] !log jmm@cumin2003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on build2004.codfw.wmnet with reason: host reimage [14:52:09] (03CR) 10Ahmon Dancy: "@abogott@wikimedia.org This is the desired final version. Lemme know what you think." [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [14:53:12] (03PS14) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [14:53:12] (03PS1) 10Btullis: presto: Match resource-group selectors on the Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) [14:54:01] (03CR) 10CI reject: [V:04-1] presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [14:54:05] (03CR) 10CI reject: [V:04-1] presto: Match resource-group selectors on the Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [14:56:09] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2245.codfw.wmnet with OS trixie [14:57:21] (03PS2) 10Btullis: presto: Match resource-group selectors on the Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) [14:57:21] (03PS15) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [14:57:48] (03PS3) 10Btullis: presto: Match resource-group selectors on the Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) [14:57:49] (03CR) 10CI reject: [V:04-1] presto: Match resource-group selectors on the Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [14:58:02] (03CR) 10CI reject: [V:04-1] presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [14:58:10] (03PS16) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [14:59:32] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-06-23-135458 to 2026-06-25-145651 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305698 (https://phabricator.wikimedia.org/T416144) [14:59:34] (03CR) 10Brouberol: [C:03+1] "LG!" [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [14:59:39] !log ongoing maintenance on cr2-eqdfw [14:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:04] brennen and jeena: I, the Bot under the Fountain, call upon thee, The Deployer, to do Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1500). [15:00:35] !log pt1979@cumin2003 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cr2-eqdfw cr2-eqdfw IPv6 with reason: junos upgrade [15:02:42] !log pt1979@cumin2003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr2-eqdfw,cr2-eqdfw IPv6 with reason: junos upgrade [15:06:14] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1247: Migration of db1247.eqiad.wmnet completed [15:08:43] !log jmm@cumin2003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host build2004.codfw.wmnet with OS trixie [15:10:18] (03PS17) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [15:10:29] (03PS18) 10Btullis: presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) [15:10:35] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [15:10:42] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [15:11:16] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2245: Migration of db2245.codfw.wmnet completed [15:12:30] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [15:19:57] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12054790 (10Jhancock.wm) 05Resolved→03Open @Marostegui i rechecked just now. i guess i was wrong. lemme know when you are done and i'll go reseat some cables. [15:20:48] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12054800 (10Marostegui) Yeah it's not booting :(. You can go ahead and do anything you need to do. Thanks! [15:22:00] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-06-23-135458 to 2026-06-25-145651 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305698 (https://phabricator.wikimedia.org/T416144) (owner: 10Jforrester) [15:22:36] Need to do a private code deploy if there aren't any objections. I don't think anything is happening in this window right now. [15:23:15] Tran: I'm deploying a service but it won't affect MW-land. [15:23:39] JennH: hey i am going to wait another 5 minutes and start the the junos upgrade on the router and reboot it looks like all the transports links are drainned now [15:24:17] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-06-23-135458 to 2026-06-25-145651 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305698 (https://phabricator.wikimedia.org/T416144) (owner: 10Jforrester) [15:24:38] James_F: Should I still wait until you're done? [15:24:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:24:51] Tran: No, just go for it. [15:25:00] alright, thanks. Starting then. [15:25:13] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:26:02] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:26:14] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:26:44] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:26:50] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:27:18] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:28:01] (Done.) [15:29:24] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [15:31:58] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [15:32:30] FIRING: Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [15:33:39] (03PS15) 10Btullis: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) (owner: 10Trueg) [15:34:28] (03PS1) 10Volans: systemd::path: fix empty Unit= in path unit [puppet] - 10https://gerrit.wikimedia.org/r/1305701 [15:34:28] (03CR) 10Volans: "It seems that this class is currently unused but I was planning to use it in the next patch in the series, and discovered the typo." [puppet] - 10https://gerrit.wikimedia.org/r/1305701 (owner: 10Volans) [15:36:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [15:36:18] (03PS1) 10Volans: utils: skip .mypy_cache in run_ci_locally.sh [puppet] - 10https://gerrit.wikimedia.org/r/1305703 [15:36:18] (03CR) 10Volans: "I've encountered this issue when running CI "locally" inside a VM on my laptop via [1]." [puppet] - 10https://gerrit.wikimedia.org/r/1305703 (owner: 10Volans) [15:37:30] FIRING: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [15:39:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:40:21] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [15:41:05] Done as well [15:45:28] (03PS1) 10C. Scott Ananian: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) [15:47:10] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [15:48:33] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [15:49:06] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:49:16] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:49:16] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:49:26] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:49:26] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:49:26] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:49:35] (03CR) 10Elukey: "@brouberol@wikimedia.org I am chatting with Tiziano, I think that this is a good occasion to do some cleanup.. the kafka mirror profile is" [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [15:49:36] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:50:22] (03CR) 10Elukey: [C:03+2] pontoon: add config for the kafka-upgrade stack used for testing [puppet] - 10https://gerrit.wikimedia.org/r/1305620 (owner: 10Elukey) [15:50:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr2-esams and cr2-eqdfw (208.80.153.217) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:50:49] (03CR) 10Btullis: [C:03+2] presto: Match resource-group selectors on the Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/1305696 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [15:51:45] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1247: Migration of db1247.eqiad.wmnet completed [15:51:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [15:52:30] FIRING: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [15:52:50] (03CR) 10C. Scott Ananian: [C:03+2] Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [15:53:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd105[3456] - https://phabricator.wikimedia.org/T419892#12055099 (10fgiunchedi) a:05Andrew→03fgiunchedi Taking this on, I'll re-assign as needed once we have a path forward [15:53:30] (03CR) 10C. Scott Ananian: [C:04-2] Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [15:53:45] (03CR) 10Filippo Giunchedi: [C:03+1] systemd::path: fix empty Unit= in path unit [puppet] - 10https://gerrit.wikimedia.org/r/1305701 (owner: 10Volans) [15:54:10] FIRING: [5x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:54:42] (03CR) 10C. Scott Ananian: "Accidentally clicked C+2 on the wrong patch, whoops." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [15:54:54] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqdfw (208.80.153.198) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:56:42] (03CR) 10Arlolra: [C:03+1] Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [15:56:48] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2245: Migration of db2245.codfw.wmnet completed [15:56:49] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [15:57:06] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:16] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:16] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:26] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:28] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:28] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:30] RESOLVED: Traffic bill over quota: Alert for device cr2-esams.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [15:57:36] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:58] (03CR) 10FNegri: [C:03+1] utils: skip .mypy_cache in run_ci_locally.sh [puppet] - 10https://gerrit.wikimedia.org/r/1305703 (owner: 10Volans) [15:59:10] RESOLVED: [5x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:59:54] RESOLVED: [7x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqdfw (208.80.153.198) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:00:05] jhathaway and rzl: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:01:05] (03PS2) 10Fabfur: cache::haproxy: add correlation id feature [puppet] - 10https://gerrit.wikimedia.org/r/1305635 (https://phabricator.wikimedia.org/T426379) [16:02:20] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2234.codfw.wmnet with OS trixie [16:09:41] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:26] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12055186 (10Jhancock.wm) okay give it another shot. If it does it again I'm gonna open a ticket with Dell. Or we can try just leaving the riser out. I'm gonna leave the ticket open until you get a... [16:13:22] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2013 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:14:22] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2013 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:14:24] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2008 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:14:38] (03PS1) 10Cwhite: prometheus: add authentication parameters to es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1305718 (https://phabricator.wikimedia.org/T350516) [16:14:41] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:15:22] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:15:37] (03PS1) 10CWilliams: Allow a single replica for sre.mysql.major-upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1305682 (https://phabricator.wikimedia.org/T429758) [16:15:39] jhathaway, rzl , federico3, _joe_ if there's space in this window, I'd like to backport a new version of parsoid to wmf.8 before group2 rolls. [16:16:15] no objection from me, we had no puppet patches [16:16:39] <_joe_> +1 [16:16:49] cscott: can you elaborate on any risk? [16:17:03] !log pt1979@cumin2003 START - Cookbook sre.hosts.remove-downtime for cr2-eqdfw,cr2-eqdfw IPv6 [16:17:05] !log pt1979@cumin2003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr2-eqdfw,cr2-eqdfw IPv6 [16:17:26] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:17:50] PROBLEM - SSH on logstash1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:17:50] federico3: latest version of parsoid has passed our internal round-trip testing and fixes some corner case bugs with nested templates and links on templatedata pages which we'd like to have live before parsoid read views is enabled on english wikipedia [16:18:11] (03CR) 10Hnowlan: [C:03+2] redis: remove nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/1305075 (https://phabricator.wikimedia.org/T384924) (owner: 10Hnowlan) [16:18:24] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:18:24] federico3: i'm trying to get this in *before* the group2 roll so that we have at least a little bit of time to bake in group1 to smoke test before turning it on everywhere [16:18:31] FIRING: [2x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1023:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:18:40] RECOVERY - SSH on logstash1023 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:19:01] (that is, the out-of-window backport is an attempt to minimize the risk, since the alternative would be a backport in the "usual" window that would immediately go live to all of group 2) [16:19:25] federico3: ^ [16:19:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:20:08] (03PS2) 10Hnowlan: redis: clean up redis nrpe check components [puppet] - 10https://gerrit.wikimedia.org/r/1305077 (https://phabricator.wikimedia.org/T384924) [16:20:22] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [16:20:22] !log cwilliams@cumin1003 dbmaint on s4@eqiad T429893 [16:20:30] T429893: Migrate s4 section to Debian Trixie - https://phabricator.wikimedia.org/T429893 [16:20:42] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1248: Upgrading db1248.eqiad.wmnet [16:20:46] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [16:20:46] !log cwilliams@cumin1003 dbmaint on s4@codfw T429893 [16:21:07] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2246: Upgrading db2246.codfw.wmnet [16:21:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [16:21:40] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2246: Upgrading db2246.codfw.wmnet [16:21:43] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1248: Upgrading db1248.eqiad.wmnet [16:22:01] FIRING: [14x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [16:22:02] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.24.0-a12 [vendor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305719 (https://phabricator.wikimedia.org/T353697) [16:23:10] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1248.eqiad.wmnet with OS trixie [16:23:31] RESOLVED: [2x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1023:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:54] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.24.0-a12 [core] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305720 (https://phabricator.wikimedia.org/T429822) [16:24:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305720 (https://phabricator.wikimedia.org/T429822) (owner: 10C. Scott Ananian) [16:24:27] federico3: does that answer your question/concern? [16:24:49] cscott: yess, +1 from me [16:25:06] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2246.codfw.wmnet with OS trixie [16:26:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [vendor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305719 (https://phabricator.wikimedia.org/T353697) (owner: 10C. Scott Ananian) [16:27:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [vendor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305719 (https://phabricator.wikimedia.org/T353697) (owner: 10C. Scott Ananian) [16:27:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305720 (https://phabricator.wikimedia.org/T429822) (owner: 10C. Scott Ananian) [16:27:37] (03PS1) 10Cwhite: prometheus: remove unused buster branch [puppet] - 10https://gerrit.wikimedia.org/r/1305721 [16:29:43] (03CR) 10Btullis: [C:03+2] presto: Enable resource groups and spill on the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1305109 (https://phabricator.wikimedia.org/T424112) (owner: 10Btullis) [16:35:21] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a12 [vendor] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305719 (https://phabricator.wikimedia.org/T353697) (owner: 10C. Scott Ananian) [16:36:06] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a12 [core] (wmf/1.47.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1305720 (https://phabricator.wikimedia.org/T429822) (owner: 10C. Scott Ananian) [16:36:37] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1305719|Bump wikimedia/parsoid to 0.24.0-a12 (T353697 T384490 T387374 T387520 T387521 T391624 T393295 T420336 T429624 T429688 T429822)]], [[gerrit:1305720|Bump wikimedia/parsoid to 0.24.0-a12 (T429822)]] [16:37:11] T353697: Parsoid/legacy parser {{Pre}} template rendering difference - https://phabricator.wikimedia.org/T353697 [16:37:11] T384490: Include directives on a line with headings prevent the legacy parser from generating section edit links - https://phabricator.wikimedia.org/T384490 [16:37:12] T387374: Compound templates prevent section edit links where legacy adds them - https://phabricator.wikimedia.org/T387374 [16:37:12] T387520: Support section edit links to nested templates - https://phabricator.wikimedia.org/T387520 [16:37:13] T387521: Section titles failing to resolve redirected templates - https://phabricator.wikimedia.org/T387521 [16:37:13] T391624: Parsoid section edit link issues - https://phabricator.wikimedia.org/T391624 [16:37:13] T393295: Measurement plan + Analysis for the "Get Started" experiment (WE1.2.17, FY24/25) - https://phabricator.wikimedia.org/T393295 [16:37:14] T420336: mw-parsoid improvements - https://phabricator.wikimedia.org/T420336 [16:37:14] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [16:37:15] T429688: mw-empty-elt wrapping does not take DOMFragments into account - https://phabricator.wikimedia.org/T429688 [16:37:15] T429822: CTT tasks week of 2026-06-19 - https://phabricator.wikimedia.org/T429822 [16:38:23] (03CR) 10Andrew Bogott: "Presumably the puppet certs exist in the first place to prevent some kind of mitm attack where a new vindictive puppetserver is injected i" [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [16:38:41] !log cscott@deploy1003 cscott: Backport for [[gerrit:1305719|Bump wikimedia/parsoid to 0.24.0-a12 (T353697 T384490 T387374 T387520 T387521 T391624 T393295 T420336 T429624 T429688 T429822)]], [[gerrit:1305720|Bump wikimedia/parsoid to 0.24.0-a12 (T429822)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:39:56] (03CR) 10Hnowlan: [C:03+1] logstash: send thumbor logs to test partition [puppet] - 10https://gerrit.wikimedia.org/r/1305260 (https://phabricator.wikimedia.org/T368180) (owner: 10Cwhite) [16:40:30] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1248.eqiad.wmnet with reason: host reimage [16:41:22] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker1160.eqiad.wmnet with OS trixie [16:41:38] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2246.codfw.wmnet with reason: host reimage [16:41:55] !log jasmine@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1160 [16:42:01] FIRING: [8x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1015:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [16:42:13] !log cscott@deploy1003 cscott: Continuing with deployment [16:42:58] !log jasmine@cumin2002 START - Cookbook sre.dns.netbox [16:45:07] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1248.eqiad.wmnet with reason: host reimage [16:46:32] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1305719|Bump wikimedia/parsoid to 0.24.0-a12 (T353697 T384490 T387374 T387520 T387521 T391624 T393295 T420336 T429624 T429688 T429822)]], [[gerrit:1305720|Bump wikimedia/parsoid to 0.24.0-a12 (T429822)]] (duration: 09m 54s) [16:46:55] T353697: Parsoid/legacy parser {{Pre}} template rendering difference - https://phabricator.wikimedia.org/T353697 [16:46:56] T384490: Include directives on a line with headings prevent the legacy parser from generating section edit links - https://phabricator.wikimedia.org/T384490 [16:46:56] T387374: Compound templates prevent section edit links where legacy adds them - https://phabricator.wikimedia.org/T387374 [16:46:57] T387520: Support section edit links to nested templates - https://phabricator.wikimedia.org/T387520 [16:46:58] T387521: Section titles failing to resolve redirected templates - https://phabricator.wikimedia.org/T387521 [16:46:58] T391624: Parsoid section edit link issues - https://phabricator.wikimedia.org/T391624 [16:46:58] T393295: Measurement plan + Analysis for the "Get Started" experiment (WE1.2.17, FY24/25) - https://phabricator.wikimedia.org/T393295 [16:46:59] T420336: mw-parsoid improvements - https://phabricator.wikimedia.org/T420336 [16:46:59] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [16:47:00] T429688: mw-empty-elt wrapping does not take DOMFragments into account - https://phabricator.wikimedia.org/T429688 [16:47:00] T429822: CTT tasks week of 2026-06-19 - https://phabricator.wikimedia.org/T429822 [16:47:17] !log jasmine@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1160 - jasmine@cumin2002" [16:47:22] !log jasmine@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1160 - jasmine@cumin2002" [16:47:22] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:47:23] !log jasmine@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker1160.eqiad.wmnet 116.48.64.10.in-addr.arpa 6.1.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:47:26] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1160.eqiad.wmnet 116.48.64.10.in-addr.arpa 6.1.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:47:27] !log jasmine@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1160 [16:48:26] (03PS1) 10C. Scott Ananian: Turn on Parsoid Read views for 5% of English Wikipedia desktop traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305724 (https://phabricator.wikimedia.org/T430194) [16:48:51] !log jasmine@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1160 [16:48:51] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1160 [16:49:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [16:49:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in ulsfo #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=ulsfo&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:50:03] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2246.codfw.wmnet with reason: host reimage [16:50:20] (03Merged) 10jenkins-bot: Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305711 (https://phabricator.wikimedia.org/T429624) (owner: 10C. Scott Ananian) [16:50:32] looking [16:50:40] <_joe_> ulsfo [16:50:46] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1305711|Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (T429624 T429822 T391624)]] [16:50:56] since 10 mins [16:52:51] !log cscott@deploy1003 cscott: Backport for [[gerrit:1305711|Turn on Parsoid's 'ReturnExperimentalPFragmentTypes' (T429624 T429822 T391624)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:53:00] T429624: Link to edit TemplateData is broken with Parsoid Read Views - https://phabricator.wikimedia.org/T429624 [16:53:00] T429822: CTT tasks week of 2026-06-19 - https://phabricator.wikimedia.org/T429822 [16:53:00] T391624: Parsoid section edit link issues - https://phabricator.wikimedia.org/T391624 [16:53:05] !ack [16:53:06] 8097 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet ulsfo) [16:53:08] !incidents [16:53:08] 8097 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet ulsfo) [16:54:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in ulsfo #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=ulsfo&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:59:14] !log cscott@deploy1003 cscott: Continuing with deployment [17:00:05] bd808: Your horoscope predicts another Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260625T1700) [17:02:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305724 (https://phabricator.wikimedia.org/T430194) (owner: 10C. Scott Ananian) [17:02:25] It looks like I can ship some developer-portal updates in today's window. [17:02:32] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1248.eqiad.wmnet with OS trixie [17:04:05] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1160.eqiad.wmnet with reason: host reimage [17:07:07] i'm just waiting for the tail end of a config deploy. seems like it's been stuck on 54 of 60 "k8s canaries" for a while [17:07:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2246.codfw.wmnet with OS trixie [17:09:22] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1160.eqiad.wmnet with reason: host reimage [17:12:32] federico3, _joe_, bd808 spiderpig failed: [17:12:41] https://www.irccloud.com/pastebin/v18aFWlz/ [17:13:05] how long was the failure and rollback? [17:13:18] i asked it to retry, but now it's stuck at 0 of 60. [17:13:26] the log says it failed after 10m and that seems about right. [17:14:00] <_joe_> cscott: we're currently looking into a page, let's see if someone else can help you [17:14:01] the running timer on spidepig says it's been trying to deploy the config change for 25m now, and that includes time its spent on the retry and time spend in the middle while arlo and i were testing on the testservers, etc [17:14:27] <_joe_> we're dealing with something a bit more urgent atm [17:14:58] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1248: Migration of db1248.eqiad.wmnet completed [17:15:10] (03CR) 10Thcipriani: [C:03+1] "Nice improvement! Readability is better and tests just fine. Nice work." [puppet] - 10https://gerrit.wikimedia.org/r/1302910 (owner: 10Ahmon Dancy) [17:15:42] cscott: taking a look! not impossible it's related to the page, but let's see [17:15:49] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-api-int/canary on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:16:19] ^ that alert is just identifying the same thing, the rollout is stuck/slow [17:16:36] (03PS1) 10BryanDavis: developer-portal: Bump container to 2026-06-25-122144-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305729