[00:09:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1152179 [00:09:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1152179 (owner: 10TrainBranchBot) [00:30:41] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1152179 (owner: 10TrainBranchBot) [00:49:15] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/c509664739cd936120697f13db222560a84899070a2a5e08d29bbc42a65d77e3/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:00:13] (03PS1) 10Andrew Bogott: nova vendordata: try another overly-clever way to untangle resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/1152182 [01:09:15] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:29:54] (03CR) 10Andrew Bogott: [C:03+2] nova vendordata: try another overly-clever way to untangle resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/1152182 (owner: 10Andrew Bogott) [03:15:12] FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:23:53] PROBLEM - MD RAID on aqs1012 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 2, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [04:18:56] (03PS1) 10Andrew Bogott: nova vendordata: just live with systemd-resolved [puppet] - 10https://gerrit.wikimedia.org/r/1152187 [04:20:18] (03PS2) 10Andrew Bogott: nova vendordata: just live with systemd-resolved [puppet] - 10https://gerrit.wikimedia.org/r/1152187 [04:20:22] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152187 (owner: 10Andrew Bogott) [04:23:49] (03CR) 10Andrew Bogott: [C:03+2] nova vendordata: just live with systemd-resolved [puppet] - 10https://gerrit.wikimedia.org/r/1152187 (owner: 10Andrew Bogott) [04:35:22] 10ops-codfw, 06DC-Ops: Unresponsive management for mc-misc2001.mgmt:22 - https://phabricator.wikimedia.org/T395643 (10phaultfinder) 03NEW [05:10:12] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:18:44] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250530T0600) [06:00:12] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:03:44] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [06:04:17] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5728/co" [puppet] - 10https://gerrit.wikimedia.org/r/1152083 (owner: 10Giuseppe Lavagetto) [06:27:29] (03CR) 10Muehlenhoff: aptrepo: add thirdparty/ci component to bookworm-wikimedia (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [06:29:27] !log uninstalling systemd-coredump (only installed on one host due to an older tests, but not needed and there's open security issues) [06:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:39] (03Abandoned) 10Bartosz Wójtowicz: ml-services: Update STORAGE_URI for articlequality model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151734 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [06:43:32] (03PS1) 10Muehlenhoff: Update canary [puppet] - 10https://gerrit.wikimedia.org/r/1152188 [06:44:44] (03PS1) 10Bartosz Wójtowicz: ml-services: Use old image for `readability-old` inference service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152189 (https://phabricator.wikimedia.org/T393865) [06:53:41] (03PS1) 10Elukey: admin_ng: disable PSP and enable PSS for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152190 (https://phabricator.wikimedia.org/T369493) [06:54:09] (03PS2) 10Elukey: admin_ng: disable PSP and enable PSS for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152190 (https://phabricator.wikimedia.org/T369493) [06:54:35] (03PS2) 10Muehlenhoff: Update canary [puppet] - 10https://gerrit.wikimedia.org/r/1152188 [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250530T0700) [07:01:26] (03CR) 10Elukey: [C:03+2] admin_ng: disable PSP and enable PSS for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152190 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [07:04:29] (03CR) 10Elukey: [C:03+1] "Image is present on the registry, LGTM :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152189 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [07:05:39] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=inference,name=eqiad [07:05:57] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Use old image for `readability-old` inference service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152189 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [07:07:17] (03Merged) 10jenkins-bot: ml-services: Use old image for `readability-old` inference service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152189 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [07:08:22] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [07:08:38] !log elukey@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:09:20] !log elukey@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:23:18] (03PS1) 10Elukey: kubernetes: disable PSP for ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1152194 (https://phabricator.wikimedia.org/T369493) [07:23:42] (03CR) 10CI reject: [V:04-1] kubernetes: disable PSP for ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1152194 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [07:26:50] (03PS2) 10Elukey: kubernetes: disable PSP for ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1152194 (https://phabricator.wikimedia.org/T369493) [07:27:02] (03PS3) 10Elukey: kubernetes: disable PSP for ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1152194 (https://phabricator.wikimedia.org/T369493) [07:36:34] (03CR) 10Klausman: kubernetes: disable PSP for ml-serve-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152194 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [07:43:29] (03CR) 10Elukey: kubernetes: disable PSP for ml-serve-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152194 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [07:43:59] (03CR) 10Federico Ceratto: [C:03+1] "Regarding the pool.py cookbook, LGTM, the change is consistent with similar changes in other cookbooks." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843 (owner: 10Volans) [07:56:07] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5729/co" [puppet] - 10https://gerrit.wikimedia.org/r/1152194 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [07:58:54] (03CR) 10Elukey: [V:03+1] kubernetes: disable PSP for ml-serve-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152194 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [08:07:27] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:12:55] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:13:45] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:17:27] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:18:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db2187 for reimaging, see T394884', diff saved to https://phabricator.wikimedia.org/P76702 and previous config saved to /var/cache/conftool/dbconfig/20250530-081804-fceratto.json [08:18:10] T394884: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884 [08:20:45] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2187.codfw.wmnet with reason: Reimaging [08:21:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2040 T395647', diff saved to https://phabricator.wikimedia.org/P76703 and previous config saved to /var/cache/conftool/dbconfig/20250530-082144-marostegui.json [08:21:49] T395647: Migrate es7 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395647 [08:23:04] (03PS1) 10Marostegui: es2040: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152240 (https://phabricator.wikimedia.org/T395647) [08:23:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2040.codfw.wmnet with reason: Maintenance [08:25:59] (03CR) 10Marostegui: [C:03+2] es2040: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152240 (https://phabricator.wikimedia.org/T395647) (owner: 10Marostegui) [08:35:48] (03CR) 10Tiziano Fogli: centrallog: Add a temporary rsyslog debug config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [08:38:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76704 and previous config saved to /var/cache/conftool/dbconfig/20250530-083836-root.json [08:53:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76705 and previous config saved to /var/cache/conftool/dbconfig/20250530-085341-root.json [09:07:41] (03CR) 10Klausman: [C:03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1152194 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [09:08:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76706 and previous config saved to /var/cache/conftool/dbconfig/20250530-090847-root.json [09:09:40] (03PS1) 10Muehlenhoff: standard_packages: Handle dnsutils/bind9-dnsutils correctly across all supported OSes [puppet] - 10https://gerrit.wikimedia.org/r/1152246 (https://phabricator.wikimedia.org/T391083) [09:11:50] (03CR) 10CI reject: [V:04-1] standard_packages: Handle dnsutils/bind9-dnsutils correctly across all supported OSes [puppet] - 10https://gerrit.wikimedia.org/r/1152246 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [09:17:48] (03PS2) 10Muehlenhoff: standard_packages: Handle dnsutils/bind9-dnsutils correctly across all OSes [puppet] - 10https://gerrit.wikimedia.org/r/1152246 (https://phabricator.wikimedia.org/T391083) [09:23:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76707 and previous config saved to /var/cache/conftool/dbconfig/20250530-092353-root.json [09:24:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152246 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [09:25:19] FIRING: CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-c8-eqiad and cloudsw1-d5 (10.64.146.253) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cloudsw1-c8-eqiad:9804&var-bgp_group=prod_ibgp4&var-bgp_neighbor=cloudsw1-d5 - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBG [09:30:19] RESOLVED: CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-c8-eqiad and cloudsw1-d5 (10.64.146.253) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cloudsw1-c8-eqiad:9804&var-bgp_group=prod_ibgp4&var-bgp_neighbor=cloudsw1-d5 - https://alerts.wikimedia.org/?q=alertname%3DCloudCore [09:39:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76708 and previous config saved to /var/cache/conftool/dbconfig/20250530-093859-root.json [09:40:03] PROBLEM - SSH on an-worker1067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:41:53] RECOVERY - SSH on an-worker1067 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:47:03] PROBLEM - SSH on an-worker1067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:53:28] (03PS1) 10Federico Ceratto: db2187.yaml: disable notifications for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1152249 (https://phabricator.wikimedia.org/T394884) [09:53:28] (03CR) 10Federico Ceratto: [C:03+1] "As discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1152249 (https://phabricator.wikimedia.org/T394884) (owner: 10Federico Ceratto) [09:53:54] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10870440 (10elukey) @Jgiannelos ran a diff test between staging (running on the new postgres cluster, maps-test2*) and prod, the results were very good (limited diffs at high percenti... [09:54:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76709 and previous config saved to /var/cache/conftool/dbconfig/20250530-095405-root.json [09:57:53] (03CR) 10Marostegui: [C:03+1] db2187.yaml: disable notifications for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1152249 (https://phabricator.wikimedia.org/T394884) (owner: 10Federico Ceratto) [10:00:12] FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:01:53] RECOVERY - SSH on an-worker1067 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:03:02] (03CR) 10MVernon: [C:03+1] "Seems like a sensible approach, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1152246 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [10:09:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76710 and previous config saved to /var/cache/conftool/dbconfig/20250530-100911-root.json [10:14:17] 06SRE, 07SRE-Unowned, 10Maps: New apus account for Tegola - https://phabricator.wikimedia.org/T395659 (10elukey) 03NEW [10:16:53] RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [10:19:32] (03CR) 10Elukey: [C:03+1] standard_packages: Handle dnsutils/bind9-dnsutils correctly across all OSes [puppet] - 10https://gerrit.wikimedia.org/r/1152246 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [10:22:04] (03CR) 10Federico Ceratto: [C:03+2] db2187.yaml: disable notifications for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1152249 (https://phabricator.wikimedia.org/T394884) (owner: 10Federico Ceratto) [10:24:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76711 and previous config saved to /var/cache/conftool/dbconfig/20250530-102416-root.json [10:28:24] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [10:28:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T395241)', diff saved to https://phabricator.wikimedia.org/P76712 and previous config saved to /var/cache/conftool/dbconfig/20250530-102830-fceratto.json [10:29:59] (03PS1) 10Dr0ptp4kt: WIP DNM: Support edge uniques A/B tests on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) [10:39:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T395241)', diff saved to https://phabricator.wikimedia.org/P76713 and previous config saved to /var/cache/conftool/dbconfig/20250530-103903-fceratto.json [10:54:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P76714 and previous config saved to /var/cache/conftool/dbconfig/20250530-105410-fceratto.json [10:57:29] !log T395592 Ran mwscript-k8s --comment="T395592" --follow -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki 'Yusuftahaluleci' 'Yusuf_Taha_Lüleci' [10:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:34] T395592: Unblock stuck global rename of Yusuf_Taha_Lüleci - https://phabricator.wikimedia.org/T395592 [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250530T0700) [11:00:04] jelto, arnoldokoth, and mutante: OwO what's this, a deployment window?? GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250530T1100). nyaa~ [11:09:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P76715 and previous config saved to /var/cache/conftool/dbconfig/20250530-110917-fceratto.json [11:24:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T395241)', diff saved to https://phabricator.wikimedia.org/P76716 and previous config saved to /var/cache/conftool/dbconfig/20250530-112423-fceratto.json [11:24:43] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [11:24:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T395241)', diff saved to https://phabricator.wikimedia.org/P76717 and previous config saved to /var/cache/conftool/dbconfig/20250530-112449-fceratto.json [11:34:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T395241)', diff saved to https://phabricator.wikimedia.org/P76718 and previous config saved to /var/cache/conftool/dbconfig/20250530-113414-fceratto.json [11:40:09] fceratto@cumin1002 reimage (PID 1341034) is awaiting input [11:41:11] (03PS1) 10Muehlenhoff: Remove access for aborrero [puppet] - 10https://gerrit.wikimedia.org/r/1152257 [11:41:49] !log fceratto@cumin1002 START - Cookbook sre.hosts.reimage for host db2187.codfw.wmnet with OS bookworm [11:49:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P76720 and previous config saved to /var/cache/conftool/dbconfig/20250530-114921-fceratto.json [11:49:32] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1152257 (owner: 10Muehlenhoff) [11:51:19] (03CR) 10Ilias Sarantopoulos: [C:04-1] "Thanks for working on this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T391964) (owner: 10Gkyziridis) [11:51:34] (03CR) 10Muehlenhoff: [C:03+2] Remove access for aborrero [puppet] - 10https://gerrit.wikimedia.org/r/1152257 (owner: 10Muehlenhoff) [12:01:10] !log fceratto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2187.codfw.wmnet with reason: host reimage [12:03:44] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2187.codfw.wmnet with reason: host reimage [12:04:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P76721 and previous config saved to /var/cache/conftool/dbconfig/20250530-120427-fceratto.json [12:19:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T395241)', diff saved to https://phabricator.wikimedia.org/P76723 and previous config saved to /var/cache/conftool/dbconfig/20250530-121934-fceratto.json [12:19:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance [12:20:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T395241)', diff saved to https://phabricator.wikimedia.org/P76724 and previous config saved to /var/cache/conftool/dbconfig/20250530-122001-fceratto.json [12:21:33] !log removing superfluous 'mode auto' command on codfw dc switches T394530 [12:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:38] T394530: Homer: redefine IBGP definitions to support both Unicast & EVPN clusters - https://phabricator.wikimedia.org/T394530 [12:26:29] (03PS3) 10Kamila Součková: aux-k8s-services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127950 (https://phabricator.wikimedia.org/T388390) [12:26:30] (03CR) 10Kamila Součková: "I think I may have wanted to first fix the possible race (which hasn't recurred recently...) before merging it? I'm honestly not sure. But" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127950 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [12:26:53] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2187.codfw.wmnet with OS bookworm [12:27:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T395241)', diff saved to https://phabricator.wikimedia.org/P76725 and previous config saved to /var/cache/conftool/dbconfig/20250530-122701-fceratto.json [12:29:35] (03PS2) 10Gkyziridis: ores-extension: enable ores extension for rrla without UI for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T395668) [12:29:48] (03PS1) 10Cathal Mooney: EVPN_BGP: add peer-as to conf to match unicast and remove auto on bfd [homer/public] - 10https://gerrit.wikimedia.org/r/1152258 (https://phabricator.wikimedia.org/T394530) [12:32:20] (03PS3) 10Gkyziridis: ores-extension: enable ores extension for rrla without UI for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T395668) [12:33:50] (03CR) 10Cathal Mooney: [C:03+2] EVPN_BGP: add peer-as to conf to match unicast and remove auto on bfd [homer/public] - 10https://gerrit.wikimedia.org/r/1152258 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [12:34:21] (03Merged) 10jenkins-bot: EVPN_BGP: add peer-as to conf to match unicast and remove auto on bfd [homer/public] - 10https://gerrit.wikimedia.org/r/1152258 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [12:37:12] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Arturo Borrero Gonzalez out of all services on: 927 hosts [12:37:18] (03CR) 10Kamila Součková: [C:03+1] wikikube: decommission wikikube-worker102[6-8].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1151759 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [12:38:23] (03CR) 10Kamila Součková: [C:03+1] wikikube: decommission wikikube-worker103[23].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1151808 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [12:41:47] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Arturo Borrero Gonzalez out of all services on: 1437 hosts [12:42:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P76727 and previous config saved to /var/cache/conftool/dbconfig/20250530-124209-fceratto.json [12:53:31] 12 [12:53:33] err :) [12:57:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P76728 and previous config saved to /var/cache/conftool/dbconfig/20250530-125717-fceratto.json [13:03:10] (03PS1) 10Federico Ceratto: db2187.yaml: Enable notifications after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1152262 (https://phabricator.wikimedia.org/T394884) [13:03:10] (03CR) 10Federico Ceratto: [C:03+1] "As discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1152262 (https://phabricator.wikimedia.org/T394884) (owner: 10Federico Ceratto) [13:12:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T395241)', diff saved to https://phabricator.wikimedia.org/P76729 and previous config saved to /var/cache/conftool/dbconfig/20250530-131223-fceratto.json [13:12:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: Maintenance [13:12:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2204 (T395241)', diff saved to https://phabricator.wikimedia.org/P76730 and previous config saved to /var/cache/conftool/dbconfig/20250530-131251-fceratto.json [13:20:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T395241)', diff saved to https://phabricator.wikimedia.org/P76731 and previous config saved to /var/cache/conftool/dbconfig/20250530-132006-fceratto.json [13:22:46] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [13:28:13] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [13:31:26] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [13:35:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P76732 and previous config saved to /var/cache/conftool/dbconfig/20250530-133514-fceratto.json [13:36:15] (03PS1) 1001tonythomas: Install newsletter extension on enwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152264 (https://phabricator.wikimedia.org/T394022) [13:37:24] (03PS2) 1001tonythomas: Install newsletter extension on enwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152264 (https://phabricator.wikimedia.org/T394022) [13:37:32] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [13:43:02] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host testvm2006.codfw.wmnet [13:44:29] (03CR) 1001tonythomas: [C:04-1] "Waiting for a final go-ahead on the Phabricator task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152264 (https://phabricator.wikimedia.org/T394022) (owner: 1001tonythomas) [13:46:47] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2006.codfw.wmnet [13:47:38] (03PS2) 10Cathal Mooney: New function to generate device-specific IBGP data from cluster YAML [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1151793 (https://phabricator.wikimedia.org/T394530) [13:50:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P76733 and previous config saved to /var/cache/conftool/dbconfig/20250530-135020-fceratto.json [13:51:10] (03PS4) 10Gkyziridis: ores-extension: enable revertrisk filter for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T395668) [13:53:04] (03PS5) 10Gkyziridis: ores-extension: enable revertrisk filter for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T395668) [13:54:16] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T395668) (owner: 10Gkyziridis) [13:54:36] (03PS1) 10Cathal Mooney: BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) [13:55:03] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10870994 (10MoritzMuehlenhoff) [13:55:18] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host testvm2007.codfw.wmnet [13:58:02] (03CR) 10CI reject: [V:04-1] BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [13:58:52] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2007.codfw.wmnet [14:03:58] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:03:58] (03PS2) 10Cathal Mooney: BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) [14:04:32] (03CR) 10CI reject: [V:04-1] BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [14:04:40] (03PS1) 10Muehlenhoff: profile::memcached::instance: Add support for nftables-compatible config (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1152274 [14:04:48] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:04:59] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host testvm2008.wikimedia.org [14:05:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T395241)', diff saved to https://phabricator.wikimedia.org/P76734 and previous config saved to /var/cache/conftool/dbconfig/20250530-140527-fceratto.json [14:05:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2225.codfw.wmnet with reason: Maintenance [14:05:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2225 (T395241)', diff saved to https://phabricator.wikimedia.org/P76735 and previous config saved to /var/cache/conftool/dbconfig/20250530-140554-fceratto.json [14:07:12] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2008.wikimedia.org [14:09:30] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host sretest1003.eqiad.wmnet [14:13:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T395241)', diff saved to https://phabricator.wikimedia.org/P76737 and previous config saved to /var/cache/conftool/dbconfig/20250530-141314-fceratto.json [14:13:21] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10871061 (10Fabfur) >>! In T392851#10859545, @Jhancock.wm wrote: > @Fabfur I'm planning on racking them tomorrow. I have a few servers ahead of it in the imaging queue b... [14:15:17] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1003.eqiad.wmnet [14:15:39] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host sretest2001.codfw.wmnet [14:17:19] (03CR) 10Fabfur: "sorry was in PTO and missed this comment, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1151277 (https://phabricator.wikimedia.org/T395358) (owner: 10Fabfur) [14:19:36] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1151644 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [14:20:47] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2001.codfw.wmnet [14:21:45] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host sretest2004.codfw.wmnet [14:24:10] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 65992616 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:25:10] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 16864 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:27:06] (03CR) 10Hashar: [C:03+1] "From the compiler: we can see:" [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [14:28:04] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2004.codfw.wmnet [14:28:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P76738 and previous config saved to /var/cache/conftool/dbconfig/20250530-142821-fceratto.json [14:32:36] (03PS3) 10Cathal Mooney: BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) [14:40:28] ACKNOWLEDGEMENT - MD RAID on aqs1012 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T395685 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:40:38] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T395685 (10ops-monitoring-bot) 03NEW [14:42:45] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T395685#10871133 (10tappof) This task has not been automatically generated but has been manually created with the command: ` root@alert1002:/var/log/icinga# /usr/lib/nagios/plugins/eventhandlers/raid_handler -d -s CRIT... [14:42:49] (03CR) 10Ladsgroup: [C:03+1] "it's all green in https://icinga.wikimedia.org/icinga/" [puppet] - 10https://gerrit.wikimedia.org/r/1152262 (https://phabricator.wikimedia.org/T394884) (owner: 10Federico Ceratto) [14:43:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P76739 and previous config saved to /var/cache/conftool/dbconfig/20250530-144329-fceratto.json [14:44:24] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1163-1165].eqiad.wmnet with reason: hard drive replacement in progress [14:44:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10871139 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6fc10fc1-13bc-409c-b7c3-cd012c8cb3f6) set b... [14:45:12] (03PS1) 10Ahmon Dancy: hieradata/cloud.yaml: Use $profile::resolving::nameserver_ips for profile::gitlab::runner::buildkitd_nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1152280 (https://phabricator.wikimedia.org/T393856) [14:45:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10871154 (10Stevemunene) We have zero under replicated blocks in the cluster {F60889551} and the hosts are listed as dec... [14:45:34] (03CR) 10CI reject: [V:04-1] hieradata/cloud.yaml: Use $profile::resolving::nameserver_ips for profile::gitlab::runner::buildkitd_nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1152280 (https://phabricator.wikimedia.org/T393856) (owner: 10Ahmon Dancy) [14:45:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10871159 (10Stevemunene) [14:46:42] (03PS1) 10Andrew Bogott: Revert a bunch of vendordata changes [puppet] - 10https://gerrit.wikimedia.org/r/1152281 [14:47:09] (03CR) 10CI reject: [V:04-1] Revert a bunch of vendordata changes [puppet] - 10https://gerrit.wikimedia.org/r/1152281 (owner: 10Andrew Bogott) [14:47:16] (03PS2) 10Ahmon Dancy: hieradata/cloud.yaml: Update profile::gitlab::runner::buildkitd_nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1152280 (https://phabricator.wikimedia.org/T393856) [14:48:33] (03PS2) 10Andrew Bogott: Revert a bunch of vendordata changes [puppet] - 10https://gerrit.wikimedia.org/r/1152281 [14:48:34] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152280 (https://phabricator.wikimedia.org/T393856) (owner: 10Ahmon Dancy) [14:49:00] (03CR) 10CI reject: [V:04-1] Revert a bunch of vendordata changes [puppet] - 10https://gerrit.wikimedia.org/r/1152281 (owner: 10Andrew Bogott) [14:49:39] (03PS3) 10Andrew Bogott: Revert a bunch of vendordata changes [puppet] - 10https://gerrit.wikimedia.org/r/1152281 [14:49:57] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery-Search, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10871195 (10dancy) The puppet failure has chan... [14:50:13] (03CR) 10Andrew Bogott: [C:03+2] Revert a bunch of vendordata changes [puppet] - 10https://gerrit.wikimedia.org/r/1152281 (owner: 10Andrew Bogott) [14:51:44] (03PS4) 10Cathal Mooney: BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) [14:57:08] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:57:23] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:58:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T395241)', diff saved to https://phabricator.wikimedia.org/P76740 and previous config saved to /var/cache/conftool/dbconfig/20250530-145835-fceratto.json [14:58:42] (03CR) 10Kgraessle: [C:03+1] ores-extension: enable revertrisk filter for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T395668) (owner: 10Gkyziridis) [14:58:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2226.codfw.wmnet with reason: Maintenance [14:59:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2226 (T395241)', diff saved to https://phabricator.wikimedia.org/P76741 and previous config saved to /var/cache/conftool/dbconfig/20250530-145901-fceratto.json [15:01:24] (03CR) 10Federico Ceratto: [C:03+2] db2187.yaml: Enable notifications after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1152262 (https://phabricator.wikimedia.org/T394884) (owner: 10Federico Ceratto) [15:05:57] (03PS4) 10Dzahn: aptrepo: add thirdparty/ci component to bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) [15:05:57] (03CR) 10Dzahn: aptrepo: add thirdparty/ci component to bookworm-wikimedia (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [15:06:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T395241)', diff saved to https://phabricator.wikimedia.org/P76742 and previous config saved to /var/cache/conftool/dbconfig/20250530-150603-fceratto.json [15:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:21] (03PS2) 10Giuseppe Lavagetto: cache::haproxy: remove unused variables from configuration [puppet] - 10https://gerrit.wikimedia.org/r/1152083 [15:11:22] (03PS1) 10Giuseppe Lavagetto: cache::haproxy: remove post_acl_actions and sticktable variables [puppet] - 10https://gerrit.wikimedia.org/r/1152288 [15:11:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10871312 (10BCornwall) [15:12:34] (03CR) 10CI reject: [V:04-1] cache::haproxy: remove post_acl_actions and sticktable variables [puppet] - 10https://gerrit.wikimedia.org/r/1152288 (owner: 10Giuseppe Lavagetto) [15:17:39] (03PS3) 10Ahmon Dancy: profile::gitlab::runner: Resolve namservers to IPs [puppet] - 10https://gerrit.wikimedia.org/r/1152280 (https://phabricator.wikimedia.org/T393856) [15:17:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:21:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P76743 and previous config saved to /var/cache/conftool/dbconfig/20250530-152111-fceratto.json [15:21:52] (03CR) 10Ahmon Dancy: "Tested via cherry-pick into gitlab-runners-puppetserver-01.gitlab-runners.eqiad1.wikimedia.cloud:/srv/git/operations/puppet" [puppet] - 10https://gerrit.wikimedia.org/r/1152280 (https://phabricator.wikimedia.org/T393856) (owner: 10Ahmon Dancy) [15:23:23] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery-Search, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10871358 (10bking) 05Open→03In progress a:... [15:30:58] PROBLEM - Hadoop NodeManager on an-worker1201 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:36:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P76744 and previous config saved to /var/cache/conftool/dbconfig/20250530-153618-fceratto.json [15:40:53] (03CR) 10Scott French: [C:03+1] "Looked good at patchset 3, still looks good now :)" [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) (owner: 10Dzahn) [15:41:52] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:46:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10871431 (10BCornwall) [15:48:52] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:49:42] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2187 gradually with 4 steps - Pooling in after reimage [15:50:58] RECOVERY - Hadoop NodeManager on an-worker1201 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:51:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T395241)', diff saved to https://phabricator.wikimedia.org/P76746 and previous config saved to /var/cache/conftool/dbconfig/20250530-155125-fceratto.json [15:51:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10871452 (10BCornwall) [15:51:45] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2238.codfw.wmnet with reason: Maintenance [15:51:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2238 (T395241)', diff saved to https://phabricator.wikimedia.org/P76747 and previous config saved to /var/cache/conftool/dbconfig/20250530-155152-fceratto.json [15:52:14] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery-Search, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10871457 (10bking) I'm taking a look at this n... [15:57:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10871463 (10BCornwall) [15:59:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T395241)', diff saved to https://phabricator.wikimedia.org/P76748 and previous config saved to /var/cache/conftool/dbconfig/20250530-155900-fceratto.json [16:01:39] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery-Search, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10871475 (10bking) Interestingly, I can reprod... [16:14:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P76750 and previous config saved to /var/cache/conftool/dbconfig/20250530-161408-fceratto.json [16:18:30] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:18:41] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:21:19] (03PS1) 10Bking: cirrussearch: remove elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1152297 (https://phabricator.wikimedia.org/T393924) [16:22:28] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152297 (https://phabricator.wikimedia.org/T393924) (owner: 10Bking) [16:23:13] (03CR) 10Ebernhardson: [C:03+1] cirrussearch: remove elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1152297 (https://phabricator.wikimedia.org/T393924) (owner: 10Bking) [16:24:14] (03CR) 10Bking: [C:03+2] cirrussearch: remove elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1152297 (https://phabricator.wikimedia.org/T393924) (owner: 10Bking) [16:28:41] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:28:52] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:29:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P76752 and previous config saved to /var/cache/conftool/dbconfig/20250530-162915-fceratto.json [16:35:08] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2187 gradually with 4 steps - Pooling in after reimage [16:39:20] (03CR) 10Alexandros Kosiaris: "Andrew is apparently on PTO. @gmodena@wikimedia.org any chance there is anyone in your team able to review this? All I need is a yay/nay o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151698 (https://phabricator.wikimedia.org/T395451) (owner: 10Alexandros Kosiaris) [16:39:30] (03PS2) 10Raymond Ndibe: toolforge:prometheus:: add components-api scrape endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1149533 (https://phabricator.wikimedia.org/T394276) [16:39:50] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery-Search, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10871623 (10bking) 05In progress→03Reso... [16:40:32] (03CR) 10Raymond Ndibe: "This has been resolved @taavi@wikimedia.org , can you look again?" [puppet] - 10https://gerrit.wikimedia.org/r/1149533 (https://phabricator.wikimedia.org/T394276) (owner: 10Raymond Ndibe) [16:42:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/3 (Core: ssw1-f1-codfw:et-0/0/31 {#changeme_cwdm2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:44:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T395241)', diff saved to https://phabricator.wikimedia.org/P76754 and previous config saved to /var/cache/conftool/dbconfig/20250530-164423-fceratto.json [16:45:40] (03PS1) 10Andrew Bogott: Update cloud-vps VMs to version 'epoxy' [puppet] - 10https://gerrit.wikimedia.org/r/1152298 [16:46:35] (03CR) 10Andrew Bogott: [C:03+2] Update cloud-vps VMs to version 'epoxy' [puppet] - 10https://gerrit.wikimedia.org/r/1152298 (owner: 10Andrew Bogott) [16:53:39] FIRING: CoreBGPDown: Core BGP session down between ssw1-e1-codfw and cr1-codfw (2620:0:860:139::1a) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=ssw1-e1-codfw:9804&var-bgp_group=core&var-bgp_neighbor=cr1-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:54:44] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery-Search, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10871657 (10dancy) Thanks @bking! There is... [16:58:39] RESOLVED: CoreBGPDown: Core BGP session down between ssw1-e1-codfw and cr1-codfw (2620:0:860:139::1a) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=ssw1-e1-codfw:9804&var-bgp_group=core&var-bgp_neighbor=cr1-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:59:54] (03PS1) 10AOkoth: site: apply doc role to doc2003 [puppet] - 10https://gerrit.wikimedia.org/r/1152300 (https://phabricator.wikimedia.org/T392130) [16:59:56] (03PS1) 10Cathal Mooney: Add BGP session from cr1-codfw to ssw1-e1-codfw and remove nokia [homer/public] - 10https://gerrit.wikimedia.org/r/1152301 (https://phabricator.wikimedia.org/T394021) [17:01:57] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10871698 (10cmooney) @Jhancock.wm as discussed on irc the link from ssw1-e1-codfw is working fine, however the link from ssw1-f1-codfw to cr2-codfw is down, showing no light either side. Can... [17:02:01] (03CR) 10Cathal Mooney: [C:03+2] Add BGP session from cr1-codfw to ssw1-e1-codfw and remove nokia [homer/public] - 10https://gerrit.wikimedia.org/r/1152301 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [17:02:31] (03Merged) 10jenkins-bot: Add BGP session from cr1-codfw to ssw1-e1-codfw and remove nokia [homer/public] - 10https://gerrit.wikimedia.org/r/1152301 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [17:02:42] (03PS5) 10Cathal Mooney: BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) [17:03:45] (03PS6) 10Cathal Mooney: BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) [17:03:47] (03PS1) 10AOkoth: trafficserver: point os-reports to k8s record [puppet] - 10https://gerrit.wikimedia.org/r/1152305 (https://phabricator.wikimedia.org/T350794) [17:06:44] (03PS1) 10Jdlrobson: Temporarily disable access for Jon [puppet] - 10https://gerrit.wikimedia.org/r/1152307 [17:07:00] (03CR) 10Jdlrobson: [C:04-1] "this should be merged on 10th June." [puppet] - 10https://gerrit.wikimedia.org/r/1152307 (owner: 10Jdlrobson) [17:07:04] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:07:08] (03PS1) 10AOkoth: add codfw to os-reports in service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1152308 (https://phabricator.wikimedia.org/T350794) [17:07:16] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:09:17] ^^ Arelion transport from eqsin to codfw down again. Traffic is ok, taking path via NTT to ulsfo and onwards to the core sites. [17:18:27] (03PS7) 10Cathal Mooney: BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) [17:19:23] 10ops-esams, 06SRE, 06DC-Ops: Inbound errors on interface cr1-esams:xe-0/0/8 (Transit: Arelion (IC-381309) {#30386}) - https://phabricator.wikimedia.org/T393213#10871752 (10RobH) #netops, What are the chances this is somethign on our end going bad when we haven't touched anything? I just am not sure of the... [17:19:38] 10ops-esams, 06SRE, 06DC-Ops: Inbound errors on interface cr1-esams:xe-0/0/8 (Transit: Arelion (IC-381309) {#30386}) - https://phabricator.wikimedia.org/T393213#10871754 (10RobH) @cmooney: Thoughts? >>! In T393213#10871742, @RobH wrote: > #netops, > > What are the chances this is somethign on our end going... [17:20:38] 10ops-esams, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389874#10871759 (10RobH) 05Open→03Resolved a:03RobH dupe of T393213, tracking issue there [17:21:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:22:44] FIRING: RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:24:10] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10871773 (10Arnoldokoth) Thank you @KFrancis [17:27:09] 10ops-esams, 06SRE, 06DC-Ops: Inbound errors on interface cr1-esams:xe-0/0/8 (Transit: Arelion (IC-381309) {#30386}) - https://phabricator.wikimedia.org/T393213#10871791 (10cmooney) >>! In T393213#10871742, @RobH wrote: > What are the chances this is somethign on our end going bad when we haven't touched any... [17:27:26] (03PS1) 10BCornwall: varnish: Replace X-Include-PV with include_pv var [puppet] - 10https://gerrit.wikimedia.org/r/1152311 (https://phabricator.wikimedia.org/T373550) [17:27:41] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:27:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:31:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling pc1 (T395241)', diff saved to https://phabricator.wikimedia.org/P76755 and previous config saved to /var/cache/conftool/dbconfig/20250530-173132-root.json [17:32:15] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for link from cr1-codfw to ssw1-e1-codfw - cmooney@cumin1002" [17:32:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for link from cr1-codfw to ssw1-e1-codfw - cmooney@cumin1002" [17:32:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:32:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:32:49] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1011.eqiad.wmnet with reason: Maintenance [17:33:32] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:33:43] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:35:48] PROBLEM - MariaDB Replica IO: pc1 on pc2011 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc1011.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc1011.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:37:44] RESOLVED: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:37:57] sigh, the downtime should have happened on the other primary too [17:39:02] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2011.codfw.wmnet with reason: Maintenance [17:45:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling pc1 (T395241)', diff saved to https://phabricator.wikimedia.org/P76756 and previous config saved to /var/cache/conftool/dbconfig/20250530-174510-root.json [17:45:50] RECOVERY - MariaDB Replica IO: pc1 on pc2011 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:46:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling pc2 (T395241)', diff saved to https://phabricator.wikimedia.org/P76757 and previous config saved to /var/cache/conftool/dbconfig/20250530-174652-root.json [17:48:04] (03CR) 10Dzahn: "I recommend just applying the role first. Add it to all_hosts afterwards. This should avoid a ticket created for failed rsync." [puppet] - 10https://gerrit.wikimedia.org/r/1152300 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [17:48:10] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1012.eqiad.wmnet with reason: Maintenance [17:48:27] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2012.codfw.wmnet with reason: Maintenance [17:51:03] (03PS1) 10Ebrahim: Remove wmgMinervaNightModeExcludeNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152327 (https://phabricator.wikimedia.org/T393977) [17:53:40] (03PS2) 10Dzahn: Add tj to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1152136 (https://phabricator.wikimedia.org/T393803) (owner: 10Jasmine) [17:53:59] (03CR) 10CI reject: [V:04-1] Remove wmgMinervaNightModeExcludeNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152327 (https://phabricator.wikimedia.org/T393977) (owner: 10Ebrahim) [17:54:03] (03PS2) 10Ebrahim: Remove wmgMinervaNightModeExcludeNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152327 (https://phabricator.wikimedia.org/T393977) [17:59:06] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:59:16] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:59:17] (03PS3) 10BCornwall: varnish: Replace X-Include-PV with include_pv var [puppet] - 10https://gerrit.wikimedia.org/r/1152311 (https://phabricator.wikimedia.org/T373550) [17:59:33] (03CR) 10BCornwall: [V:03+1] "`0 tests failed, 0 tests skipped, 39 tests passed`" [puppet] - 10https://gerrit.wikimedia.org/r/1152311 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [17:59:33] (03PS1) 10Dzahn: backup: add doc1004 to job_monitoring_ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1152330 (https://phabricator.wikimedia.org/T392130) [18:00:25] (03CR) 10Jdlrobson: "Seems like a risky change for little gain. I'd rather we focused on removing the code upstream personally, particularly since other projec" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152327 (https://phabricator.wikimedia.org/T393977) (owner: 10Ebrahim) [18:00:58] (03CR) 10Dzahn: "doc2003 should probably also be added before https://gerrit.wikimedia.org/r/c/operations/puppet/+/1152300 is merged" [puppet] - 10https://gerrit.wikimedia.org/r/1152330 (https://phabricator.wikimedia.org/T392130) (owner: 10Dzahn) [18:01:26] (03CR) 10Dzahn: "feel free to amend and merge anytime" [puppet] - 10https://gerrit.wikimedia.org/r/1152330 (https://phabricator.wikimedia.org/T392130) (owner: 10Dzahn) [18:01:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling pc2 (T395241)', diff saved to https://phabricator.wikimedia.org/P76758 and previous config saved to /var/cache/conftool/dbconfig/20250530-180131-root.json [18:01:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling pc3 (T395241)', diff saved to https://phabricator.wikimedia.org/P76759 and previous config saved to /var/cache/conftool/dbconfig/20250530-180140-root.json [18:02:10] FIRING: BFDdown: BFD session down between cr3-eqsin and 103.102.166.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:02:56] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1013.eqiad.wmnet with reason: Maintenance [18:03:12] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2013.codfw.wmnet with reason: Maintenance [18:07:10] RESOLVED: BFDdown: BFD session down between cr3-eqsin and 103.102.166.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:07:52] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [18:08:03] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [18:12:13] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@cd72f3e]: bump section topics to v1.2.0 and SEAL to v0.6.0 [18:13:20] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@cd72f3e]: bump section topics to v1.2.0 and SEAL to v0.6.0 (duration: 01m 33s) [18:16:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling pc3 (T395241)', diff saved to https://phabricator.wikimedia.org/P76760 and previous config saved to /var/cache/conftool/dbconfig/20250530-181616-root.json [18:17:07] (03PS7) 10BCornwall: varnish: Replace X-Page-ID with variable [puppet] - 10https://gerrit.wikimedia.org/r/1152118 (https://phabricator.wikimedia.org/T373550) [18:17:08] (03PS4) 10BCornwall: varnish: Replace X-Include-PV with include_pv var [puppet] - 10https://gerrit.wikimedia.org/r/1152311 (https://phabricator.wikimedia.org/T373550) [18:18:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling pc4 (T395241)', diff saved to https://phabricator.wikimedia.org/P76761 and previous config saved to /var/cache/conftool/dbconfig/20250530-181855-root.json [18:20:00] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1014.eqiad.wmnet with reason: Maintenance [18:20:18] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2014.codfw.wmnet with reason: Maintenance [18:20:54] (03PS2) 10Dr0ptp4kt: WIP DNM: Support edge uniques A/B tests on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) [18:21:20] (03CR) 10Dr0ptp4kt: [C:03+1] WIP DNM: Support edge uniques A/B tests on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) (owner: 10Dr0ptp4kt) [18:22:40] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (and kerberos) for Giuseppe Lavagetto - https://phabricator.wikimedia.org/T395413#10871963 (10Kappakayala) yes, I approve. Please provide access. [18:33:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling pc4 (T395241)', diff saved to https://phabricator.wikimedia.org/P76763 and previous config saved to /var/cache/conftool/dbconfig/20250530-183310-root.json [18:33:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling pc5 (T395241)', diff saved to https://phabricator.wikimedia.org/P76764 and previous config saved to /var/cache/conftool/dbconfig/20250530-183319-root.json [18:34:36] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1015.eqiad.wmnet with reason: Maintenance [18:34:52] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2015.codfw.wmnet with reason: Maintenance [18:45:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling pc5 (T395241)', diff saved to https://phabricator.wikimedia.org/P76765 and previous config saved to /var/cache/conftool/dbconfig/20250530-184552-root.json [18:46:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling pc6 (T395241)', diff saved to https://phabricator.wikimedia.org/P76766 and previous config saved to /var/cache/conftool/dbconfig/20250530-184605-root.json [18:47:22] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1016.eqiad.wmnet with reason: Maintenance [18:47:38] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2016.codfw.wmnet with reason: Maintenance [18:48:22] (03PS3) 10Dr0ptp4kt: Support edge uniques A/B tests on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) [18:52:03] (03PS1) 10Eevans: cassandra: reuse preseed for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1152337 (https://phabricator.wikimedia.org/T391544) [18:52:32] (03CR) 10CI reject: [V:04-1] cassandra: reuse preseed for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1152337 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [18:56:16] (03CR) 10AOkoth: [C:03+1] backup: add doc1004 to job_monitoring_ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1152330 (https://phabricator.wikimedia.org/T392130) (owner: 10Dzahn) [18:57:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:57:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:58:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling pc6 (T395241)', diff saved to https://phabricator.wikimedia.org/P76767 and previous config saved to /var/cache/conftool/dbconfig/20250530-185823-root.json [18:58:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling pc7 (T395241)', diff saved to https://phabricator.wikimedia.org/P76768 and previous config saved to /var/cache/conftool/dbconfig/20250530-185832-root.json [18:59:49] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: Maintenance [19:00:06] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2017.codfw.wmnet with reason: Maintenance [19:00:11] jhancock@cumin2002 provision (PID 1198893) is awaiting input [19:00:14] (03PS1) 10AOkoth: admin: add oblivian to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1152338 (https://phabricator.wikimedia.org/T395413) [19:00:30] jhancock@cumin2002 provision (PID 1199271) is awaiting input [19:02:53] (03CR) 10Dzahn: [C:03+2] backup: add doc1004 to job_monitoring_ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1152330 (https://phabricator.wikimedia.org/T392130) (owner: 10Dzahn) [19:04:01] (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1152338 (https://phabricator.wikimedia.org/T395413) (owner: 10AOkoth) [19:05:14] (03CR) 10AOkoth: [C:03+2] admin: add oblivian to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1152338 (https://phabricator.wikimedia.org/T395413) (owner: 10AOkoth) [19:10:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling pc7 (T395241)', diff saved to https://phabricator.wikimedia.org/P76769 and previous config saved to /var/cache/conftool/dbconfig/20250530-191057-root.json [19:11:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling pc8 (T395241)', diff saved to https://phabricator.wikimedia.org/P76770 and previous config saved to /var/cache/conftool/dbconfig/20250530-191105-root.json [19:11:36] (03CR) 10Dzahn: [C:03+1] "noticed this still needs the Kerberos part https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Kerberos/Administration#Create_a_prin" [puppet] - 10https://gerrit.wikimedia.org/r/1152338 (https://phabricator.wikimedia.org/T395413) (owner: 10AOkoth) [19:12:22] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1018.eqiad.wmnet with reason: Maintenance [19:12:38] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2018.codfw.wmnet with reason: Maintenance [19:13:47] (03PS1) 10AOkoth: admin: add krb present for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1152341 (https://phabricator.wikimedia.org/T395413) [19:14:18] (03CR) 10Dzahn: [C:03+1] admin: add krb present for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1152341 (https://phabricator.wikimedia.org/T395413) (owner: 10AOkoth) [19:14:56] (03CR) 10AOkoth: [C:03+2] admin: add krb present for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1152341 (https://phabricator.wikimedia.org/T395413) (owner: 10AOkoth) [19:15:19] (03CR) 10Dzahn: ":) great! just did not want to merge this on a Friday and walk away. let's get it started next week, k?" [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [19:16:57] (03PS2) 10Eevans: cassandra: reuse preseed for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1152337 (https://phabricator.wikimedia.org/T391544) [19:17:08] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users (and kerberos) for Giuseppe Lavagetto - https://phabricator.wikimedia.org/T395413#10872262 (10Arnoldokoth) [19:20:03] jhancock@cumin2002 provision (PID 1198893) is awaiting input [19:23:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling pc8 (T395241)', diff saved to https://phabricator.wikimedia.org/P76771 and previous config saved to /var/cache/conftool/dbconfig/20250530-192320-root.json [19:23:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling ms1 (T395241)', diff saved to https://phabricator.wikimedia.org/P76772 and previous config saved to /var/cache/conftool/dbconfig/20250530-192329-root.json [19:24:46] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1152.eqiad.wmnet with reason: Maintenance [19:25:03] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2142.codfw.wmnet with reason: Maintenance [19:26:35] jhancock@cumin2002 provision (PID 1199271) is awaiting input [19:27:45] (03CR) 10Clare Ming: [C:03+1] Support edge uniques A/B tests on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) (owner: 10Dr0ptp4kt) [19:29:42] (03CR) 10Majavah: [C:04-2] "per https://phabricator.wikimedia.org/T394022#10872301." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152264 (https://phabricator.wikimedia.org/T394022) (owner: 1001tonythomas) [19:33:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:33:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:34:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2047.codfw.wmnet with OS bookworm [19:34:36] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106#10872324 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2047.codfw.wmnet with OS bookworm [19:34:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2048.codfw.wmnet with OS bookworm [19:35:09] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106#10872336 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2048.codfw.wmnet with OS bookworm [19:35:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling ms1 (T395241)', diff saved to https://phabricator.wikimedia.org/P76773 and previous config saved to /var/cache/conftool/dbconfig/20250530-193537-root.json [19:39:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling ms2 (T395241)', diff saved to https://phabricator.wikimedia.org/P76774 and previous config saved to /var/cache/conftool/dbconfig/20250530-193951-root.json [19:41:08] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1151.eqiad.wmnet with reason: Maintenance [19:41:24] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2144.codfw.wmnet with reason: Maintenance [19:45:56] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts relforge[1003-1004].eqiad.wmnet [19:46:03] (03CR) 10Dzahn: "after looking at code more.. I partially take that back. This should both allow doc2003 to connect on the $active_host AND also make doc20" [puppet] - 10https://gerrit.wikimedia.org/r/1152300 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [19:47:34] (03CR) 10Bking: [C:03+2] cirrus streaming updater: decom relforge100[3,4] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147893 (https://phabricator.wikimedia.org/T390565) (owner: 10Ryan Kemper) [19:47:40] (03PS1) 10Bvibber: Revert "JCCache: Use WANObjectCache::getWithSetCallback() instead of set/get" [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152346 (https://phabricator.wikimedia.org/T395368) [19:49:04] bking@cumin2002 decommission (PID 1226327) is awaiting input [19:50:58] thcipriani: bvibber thinks that we can stop the issues with T395368 with this revert -- https://gerrit.wikimedia.org/r/c/mediawiki/extensions/JsonConfig/+/1152346 -- We have confirmed 500 errors on emwiki in that bug. Can we get a :thumbsup: on deploying the revert? [19:50:58] T395368: PHP Warning: Attempt to read property "fields" on null - https://phabricator.wikimedia.org/T395368 [19:51:08] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:51:49] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:52:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2047.codfw.wmnet with reason: host reimage [19:52:44] * thcipriani looking [19:52:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2048.codfw.wmnet with reason: host reimage [19:54:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling ms2 (T395241)', diff saved to https://phabricator.wikimedia.org/P76775 and previous config saved to /var/cache/conftool/dbconfig/20250530-195427-root.json [19:54:28] bvibber: bd808 reverting sounds reasonable if it's effecting users on enwiki (modulo test failures on the revert) [19:54:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling ms3 (T395241)', diff saved to https://phabricator.wikimedia.org/P76776 and previous config saved to /var/cache/conftool/dbconfig/20250530-195436-root.json [19:54:47] whee [19:55:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2047.codfw.wmnet with reason: host reimage [19:55:41] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1153.eqiad.wmnet with reason: Maintenance [19:55:58] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2143.codfw.wmnet with reason: Maintenance [19:56:07] thcipriani: thanks. [19:57:37] (03Abandoned) 10Ebrahim: Remove wmgMinervaNightModeExcludeNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152327 (https://phabricator.wikimedia.org/T393977) (owner: 10Ebrahim) [19:57:53] (03CR) 10Ebrahim: "Understadable" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152327 (https://phabricator.wikimedia.org/T393977) (owner: 10Ebrahim) [19:58:01] bvibber: sword fight in the hallway while we wait for Jerkins and zuul? :) [19:58:10] :D [19:58:45] * bd808 misses nerf wars at 149NM [19:58:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2048.codfw.wmnet with reason: host reimage [19:59:00] (03PS1) 10Bking: Revert "cirrus streaming updater: decom relforge100[3,4]" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152347 [19:59:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:59:26] (03CR) 10Bking: [V:03+2 C:03+2] Revert "cirrus streaming updater: decom relforge100[3,4]" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152347 (owner: 10Bking) [19:59:50] (03CR) 10CI reject: [V:04-1] Revert "JCCache: Use WANObjectCache::getWithSetCallback() instead of set/get" [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152346 (https://phabricator.wikimedia.org/T395368) (owner: 10Bvibber) [20:00:14] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:00:28] hm [20:00:36] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:00:42] sigh [20:01:34] (03CR) 10Bvibber: "recheck" [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152346 (https://phabricator.wikimedia.org/T395368) (owner: 10Bvibber) [20:02:01] good ol browser tests :D [20:03:04] https://www.irccloud.com/pastebin/pT6A2OB0/ [20:03:43] the read only errors of main stash should be gone to zero now [20:03:51] woohoo [20:04:02] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts relforge[1003-1004].eqiad.wmnet [20:04:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:07:32] bvibber: that failure is apparently T395684 [20:07:32] T395684: CI error: mediawiki.base/track trackError: unexpected "{\"exception\":{},\"source\":\"resolve\"}" - https://phabricator.wikimedia.org/T395684 [20:08:05] sigh [20:08:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling ms3 (T395241)', diff saved to https://phabricator.wikimedia.org/P76777 and previous config saved to /var/cache/conftool/dbconfig/20250530-200835-root.json [20:09:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:10:08] bvibber: I guess we can try backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1152259. That's what folks think fixed it in trunk [20:10:54] (03PS1) 10BryanDavis: qunit: More readable assert.step() call in mediawiki.base/track test [core] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152348 (https://phabricator.wikimedia.org/T395684) [20:12:16] (03PS1) 10BryanDavis: ext.wikimediaEvents: Soft-depend on MetricsPlatform [extensions/WikimediaEvents] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152349 (https://phabricator.wikimedia.org/T395684) [20:12:49] It helps to cherry-pick the patch you meant to and not the other one :/ [20:12:53] hehe [20:13:27] (03PS2) 10Krinkle: Revert "JCCache: Use WANObjectCache::getWithSetCallback() instead of set/get" [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152346 (https://phabricator.wikimedia.org/T395368) (owner: 10Bvibber) [20:13:43] (03CR) 10Krinkle: "Riding https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1152349 on CI to confirm that that fixes the failure." [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152346 (https://phabricator.wikimedia.org/T395368) (owner: 10Bvibber) [20:14:10] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:15:02] (03CR) 10Krinkle: "`" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152349 (https://phabricator.wikimedia.org/T395684) (owner: 10BryanDavis) [20:15:05] (03CR) 10Krinkle: "recheck" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152349 (https://phabricator.wikimedia.org/T395684) (owner: 10BryanDavis) [20:17:14] (03CR) 10Krinkle: "T282893" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152349 (https://phabricator.wikimedia.org/T395684) (owner: 10BryanDavis) [20:17:16] jhancock@cumin2002 reimage (PID 1220767) is awaiting input [20:19:04] that is well more than enough cascading CI errors Jerkins. Play nice or we will send you home. [20:19:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:19:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2047.codfw.wmnet with OS bookworm [20:19:19] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:19:22] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106#10872455 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2047.codfw.wmnet with OS bookworm completed: - es2047 (**PASS**) - Re... [20:19:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:19:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2048.codfw.wmnet with OS bookworm [20:19:50] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106#10872456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2048.codfw.wmnet with OS bookworm completed: - es2048 (**PASS**) - Re... [20:19:58] (03CR) 10CI reject: [V:04-1] ext.wikimediaEvents: Soft-depend on MetricsPlatform [extensions/WikimediaEvents] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152349 (https://phabricator.wikimedia.org/T395684) (owner: 10BryanDavis) [20:20:44] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106#10872457 (10Jhancock.wm) 05Open→03Resolved [20:21:05] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106#10872460 (10Jhancock.wm) @Marostegui this is completed [20:23:50] (03CR) 10BryanDavis: "recheck" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152349 (https://phabricator.wikimedia.org/T395684) (owner: 10BryanDavis) [20:32:24] ok are we ready to roll on 1152349 + 1152346 deployment together? or anything else still need poking first :D [20:33:03] I don't think so. We can ship 1152348 at the same time too or I can abandon it as not really needed for backporting [20:33:23] take your pick :) you wanna have the honors? [20:33:35] that was the "extra" one I grabbed because e_TOOMANYTABS. I'll abandon it. [20:33:41] cool [20:33:54] (03Abandoned) 10BryanDavis: qunit: More readable assert.step() call in mediawiki.base/track test [core] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152348 (https://phabricator.wikimedia.org/T395684) (owner: 10BryanDavis) [20:34:42] do you want to push buttons or should I bvibber? [20:36:08] heh I see us differing to each other now. I'll do the needful [20:36:11] lol [20:36:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152349 (https://phabricator.wikimedia.org/T395684) (owner: 10BryanDavis) [20:36:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152346 (https://phabricator.wikimedia.org/T395368) (owner: 10Bvibber) [20:36:46] wee [20:37:09] Thanks for catching that bvibber. It was quite a lot to untangle but Derick and I still managed to miss a case in get(). [20:37:21] folks with the permissions set up can watch at https://spiderpig.wikimedia.org/ too if they want [20:37:24] :D [20:37:38] Thinking that maybe storing `null` will be easier going forward rather than false > '' > false [20:37:48] yeah it's kinda spaghetti in there, i'm trying to leave it cleaner than i found it when i go in and add things :D [20:37:49] but will need to look at callers etc [20:38:11] yeah, we were doing the same thing, driven by sus use of code using WANObjectCache get() and set() directly instead of getWithSet. [20:38:12] (03Merged) 10jenkins-bot: ext.wikimediaEvents: Soft-depend on MetricsPlatform [extensions/WikimediaEvents] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152349 (https://phabricator.wikimedia.org/T395684) (owner: 10BryanDavis) [20:38:14] (03Merged) 10jenkins-bot: Revert "JCCache: Use WANObjectCache::getWithSetCallback() instead of set/get" [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152346 (https://phabricator.wikimedia.org/T395368) (owner: 10Bvibber) [20:38:26] it's using memcSet() with instance side-effects and load() likewise as well. [20:38:37] !log bd808@deploy1003 Started scap sync-world: Backport for [[gerrit:1152349|ext.wikimediaEvents: Soft-depend on MetricsPlatform (T395684 T395494)]], [[gerrit:1152346|Revert "JCCache: Use WANObjectCache::getWithSetCallback() instead of set/get" (T395368)]] [20:38:44] T395684: CI error: mediawiki.base/track trackError: unexpected "{\"exception\":{},\"source\":\"resolve\"}" - https://phabricator.wikimedia.org/T395684 [20:38:45] T395494: MetricsPlatform dependency is causing CI to fail on older (yet supported) MediaWiki release branches (REL1_39, REL1_42) - https://phabricator.wikimedia.org/T395494 [20:38:45] T395368: PHP Warning: Attempt to read property "fields" on null - https://phabricator.wikimedia.org/T395368 [20:38:48] yeah th ecallback is cleaner [20:39:00] i think the initial wancache conversion was my fault so, hey! full circle ;) [20:40:17] bd808: that fast CI pass, I'm guessing that's thanks to Dan's CI result cache? [20:40:31] yeah, I think so [20:40:37] !log bd808@deploy1003 bd808, bvibber: Backport for [[gerrit:1152349|ext.wikimediaEvents: Soft-depend on MetricsPlatform (T395684 T395494)]], [[gerrit:1152346|Revert "JCCache: Use WANObjectCache::getWithSetCallback() instead of set/get" (T395368)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:40:41] that's a real game changer around backports [20:40:49] 15min *poof* [20:41:16] https://en.wikipedia.org/wiki/California%27s_3rd_congressional_district loads on the test servers again \o/ [20:41:36] !log bd808@deploy1003 bd808, bvibber: Continuing with sync [20:41:37] \o/ [20:41:51] oh i am so glad that actually worked too lol [20:41:58] heh [20:42:28] don't doubt your long honed instincts bvibber :) [20:42:49] :) [20:42:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/3 (Core: ssw1-f1-codfw:et-0/0/31 {#changeme_cwdm2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:48:37] !log bd808@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152349|ext.wikimediaEvents: Soft-depend on MetricsPlatform (T395684 T395494)]], [[gerrit:1152346|Revert "JCCache: Use WANObjectCache::getWithSetCallback() instead of set/get" (T395368)]] (duration: 09m 59s) [20:48:44] T395684: CI error: mediawiki.base/track trackError: unexpected "{\"exception\":{},\"source\":\"resolve\"}" - https://phabricator.wikimedia.org/T395684 [20:48:44] T395494: MetricsPlatform dependency is causing CI to fail on older (yet supported) MediaWiki release branches (REL1_39, REL1_42) - https://phabricator.wikimedia.org/T395494 [20:48:45] T395368: PHP Warning: Attempt to read property "fields" on null - https://phabricator.wikimedia.org/T395368 [20:49:08] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cirrussearch[1055-1059].eqiad.wmnet [20:49:46] https://en.wikipedia.org/wiki/California%27s_3rd_congressional_district working on all the servers now. Thanks for the debugging and patch babysitting bvibber and Krinkle. [20:50:02] woohoo! good work, team [20:50:13] <3 [20:52:05] bvibber: Does https://gerrit.wikimedia.org/r/c/mediawiki/extensions/JsonConfig/+/1152128 still apply? Or did that one have a different root cause? (I see another task tagged on that one, so not sure) [20:54:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:55:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:57:02] Krinkle: same root cause though it could conceivably come up with previously-unvalidated old data [20:57:25] but it'll no longer be a priority as there's no longer a flood of nulls into the validation and lua conversion routines :D [20:59:23] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 102508344 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:00:23] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 5781664 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:04:35] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:05:31] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10872556 (10Jhancock.wm) @Andrew hey sorry to get back to this so late. This one does not have a raid controller. just an HBA. cannot set a hardware raid. [21:05:39] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10872557 (10Jhancock.wm) [21:08:03] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1055-1059].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [21:08:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:08:28] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1055-1059].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [21:08:29] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:08:30] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cirrussearch[1055-1059].eqiad.wmnet [21:11:18] jhancock@cumin2002 provision (PID 1277813) is awaiting input [21:37:15] jhancock@cumin2002 provision (PID 1277813) is awaiting input [21:38:28] (03CR) 10BryanDavis: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [21:50:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:51:01] (03CR) 10A smart kitten: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [21:51:48] (03CR) 10A smart kitten: "(ignore; accidental click)" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [22:20:30] (03CR) 10BryanDavis: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1143602/4083/ is clean for prod hosts. See https://phabricator.wikimedia.org/T393404#10872680 f" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [22:40:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [22:40:30] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10872789 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2005.codfw.wmnet with OS bookworm [22:41:26] (03PS3) 10Eevans: cassandra: reuse preseed for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1152337 (https://phabricator.wikimedia.org/T391544) [22:49:28] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10872793 (10Jhancock.wm) @RobH i'm having trouble figuring out what raid to use for this one. The preseed might need an update. could you take a look at it for me when you are free? [23:22:45] (03PS8) 10Cathal Mooney: BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) [23:38:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1152377 [23:38:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1152377 (owner: 10TrainBranchBot) [23:49:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1152377 (owner: 10TrainBranchBot)