[00:00:03] (03CR) 10Dzahn: [V:03+1] "this let's us monitor both sites with the standard blackbox checks" [puppet] - 10https://gerrit.wikimedia.org/r/1248127 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:02:07] (03CR) 10Dzahn: [V:03+1] "This has no effect on hosts that are not the CI manager (active CI server) so far." [puppet] - 10https://gerrit.wikimedia.org/r/1248127 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:02:32] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1248127/8260/" [puppet] - 10https://gerrit.wikimedia.org/r/1248127 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:03:27] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7012.magru.wmnet with OS trixie [00:05:39] 06SRE, 10SRE-Access-Requests, 07OKR-Work, 13Patch-For-Review, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency - https://phabricator.wikimedia.org/T418723#11700394 (10Dzahn) Adding checklist from https://wikitech.wikimedia.org/wiki/SRE... [00:06:25] 06SRE, 10SRE-Access-Requests, 07OKR-Work, 13Patch-For-Review, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency - https://phabricator.wikimedia.org/T418723#11700395 (10Dzahn) [00:07:04] 06SRE, 10SRE-Access-Requests, 07OKR-Work, 13Patch-For-Review, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency - https://phabricator.wikimedia.org/T418723#11700396 (10Dzahn) [00:07:21] 06SRE, 10SRE-Access-Requests, 07OKR-Work, 13Patch-For-Review, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency / Add lerickson and trueg to analytics-wikidata-users - https://phabricator.wikimedia.org/T418723#11700397 (10Dzahn) [00:07:50] (03CR) 10Scott French: [C:03+1] mathoid: Upgrade to envoy-future:1.35.9 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250728 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [00:08:26] (03CR) 10Scott French: [C:03+1] {api,rest}-gateway: Update staging to Envoy 1.35.9 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250731 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [00:09:29] 06SRE, 10SRE-Access-Requests, 07OKR-Work, 13Patch-For-Review, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency / Add lerickson and trueg to analytics-wikidata-users - https://phabricator.wikimedia.org/T418723#11700398 (10Dzahn) [00:15:18] 06SRE, 10SRE-Access-Requests, 07OKR-Work, 13Patch-For-Review, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency / Add lerickson and trueg to analytics-wikidata-users - https://phabricator.wikimedia.org/T418723#11700401 (10Dzahn) [00:19:58] FIRING: [3x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:20:13] (03CR) 10Dzahn: "I think it just needs the manager approval but all else is done. I copied the access request check list to the ticket and marked the boxes" [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [00:24:57] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7012.magru.wmnet with reason: host reimage [00:26:52] (03CR) 10Ottomata: stream: mediawiki.page_edit_type_simple (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249360 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [00:27:32] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: sync [00:29:51] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7012.magru.wmnet with reason: host reimage [00:31:47] (03PS1) 10Dzahn: jenkins: add ci::httpd profile to role [puppet] - 10https://gerrit.wikimedia.org/r/1250752 (https://phabricator.wikimedia.org/T418521) [00:32:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:33:10] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync [00:33:34] (03CR) 10Dzahn: [V:04-1] "Evaluation Error: Error while evaluating a Function Call, profile not supported by trixie (file: /srv/jenkins/puppet-compiler/8261/change/" [puppet] - 10https://gerrit.wikimedia.org/r/1250752 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:34:27] (03CR) 10Dzahn: [V:03+1 C:04-1] "needs I6f67e4f00fb8a2c1" [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:35:24] (03CR) 10Dzahn: [V:04-1] "here is where it's starting to become trickier:" [puppet] - 10https://gerrit.wikimedia.org/r/1250752 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:36:27] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: sync [00:36:49] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync [00:37:01] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: sync [00:37:22] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync [00:38:05] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync [00:38:34] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync [00:39:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1250754 [00:39:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1250754 (owner: 10TrainBranchBot) [00:40:03] (03PS1) 10Dzahn: profile::ci: add support for trixie / PHP8.4 [puppet] - 10https://gerrit.wikimedia.org/r/1250755 (https://phabricator.wikimedia.org/T418521) [00:41:38] (03CR) 10Dzahn: "@Antoine this assumes we need httpd with PHP on the jenkins trixie host. alternative would be to refactor or only include httpd itself wit" [puppet] - 10https://gerrit.wikimedia.org/r/1250755 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:42:22] (03CR) 10Dzahn: [V:04-1] "needs I1224d36f60490c9423b" [puppet] - 10https://gerrit.wikimedia.org/r/1250752 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:42:26] (03CR) 10Dzahn: [V:04-1 C:04-1] jenkins: add ci::httpd profile to role [puppet] - 10https://gerrit.wikimedia.org/r/1250752 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:49:16] (03PS1) 10Dzahn: add zuul-legacy to point at old zuul [dns] - 10https://gerrit.wikimedia.org/r/1250756 (https://phabricator.wikimedia.org/T418521) [00:50:24] (03PS2) 10Dzahn: add zuul-legacy to point at old zuul [dns] - 10https://gerrit.wikimedia.org/r/1250756 (https://phabricator.wikimedia.org/T418521) [00:50:59] (03CR) 10RLazarus: [C:03+2] {api,rest}-gateway: Update staging to Envoy 1.35.9 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250731 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [00:51:29] (03CR) 10Dzahn: "The new name for the old zuul which we would point to the existing contint bullseye. (and then add the apache virtual host accordingly). A" [dns] - 10https://gerrit.wikimedia.org/r/1250756 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:52:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1250754 (owner: 10TrainBranchBot) [00:53:10] (03Merged) 10jenkins-bot: {api,rest}-gateway: Update staging to Envoy 1.35.9 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250731 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [00:58:31] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [00:58:54] (03CR) 10Lerickson: "Thank you! trueg and I have the same manager, namely David Santamaria. I pinged him the ticket." [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [00:59:47] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [00:59:47] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7012.magru.wmnet with OS trixie [01:00:10] (03PS1) 10Dzahn: trafficserver/contint: add zuul-legacy site [puppet] - 10https://gerrit.wikimedia.org/r/1250757 (https://phabricator.wikimedia.org/T418521) [01:00:41] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2004.codfw.wmnet with OS trixie [01:00:48] (03CR) 10Dzahn: "sounds good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [01:02:00] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:03:09] !log reprepro include php8.3_8.3.30-1+icu72+wmf11u1 into component/php83-icu72 - T419058 [01:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:13] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [01:05:15] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [01:05:33] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [01:08:28] !log reprepro include php-defaults_94+icu72+wmf11u1 into component/php83-icu72 - T419058 [01:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:32] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [01:09:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1250758 [01:09:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1250758 (owner: 10TrainBranchBot) [01:13:49] !log reprepro include dh-php_5.5+icu72+wmf11u1 into component/php83-icu72 - T419058 [01:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:53] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [01:15:27] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2004.codfw.wmnet with reason: host reimage [01:18:11] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [01:18:26] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2004.codfw.wmnet with reason: host reimage [01:20:58] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [01:21:13] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11700463 (10RobH) 05Open→03Resolved > Hi Rob, > > This is to confirm that we received the files, and this have been share... [01:24:31] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7012.* [01:25:36] 10ops-magru, 06DC-Ops, 06Traffic: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11700467 (10BCornwall) 05Open→03Resolved Thank you for all your work, rob. I was able to reimage and all seems well now. I'll re-open this is anything changes. [01:27:56] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1250758 (owner: 10TrainBranchBot) [01:36:08] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2004.codfw.wmnet with OS trixie [01:37:07] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2005.codfw.wmnet with OS trixie [01:42:25] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:47:01] !log reprepro include php-apcu_5.1.24-1+icu72+wmf11u1 into component/php83-icu72 - T419058 [01:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:04] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [01:49:17] !log reprepro include php-msgpack_3.0.0-1+icu72+wmf11u1 into component/php83-icu72 - T419058 [01:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:07] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2005.codfw.wmnet with reason: host reimage [01:59:45] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2005.codfw.wmnet with reason: host reimage [02:00:54] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:03:00] !log reprepro include php-igbinary_3.2.16-4+icu72+wmf11u1 into component/php83-icu72 - T419058 [02:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:04] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [02:04:58] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:09:08] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 08m 14s) [02:14:58] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:16:30] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2005.codfw.wmnet with OS trixie [03:52:12] (03PS5) 10AKhatun: stream: mediawiki.page_edit_type_simple [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249360 (https://phabricator.wikimedia.org/T351225) [03:53:35] (03CR) 10AKhatun: stream: mediawiki.page_edit_type_simple (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249360 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [04:19:58] FIRING: [3x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:02:00] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:11:10] (03CR) 10KartikMistry: [C:03+2] machinetranslation: Optimize model loading and memory footprints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248388 (https://phabricator.wikimedia.org/T411058) (owner: 10KartikMistry) [05:13:22] (03Merged) 10jenkins-bot: machinetranslation: Optimize model loading and memory footprints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248388 (https://phabricator.wikimedia.org/T411058) (owner: 10KartikMistry) [05:16:34] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [05:19:16] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [05:24:44] !log staging: machinetranslation: Optimize model loading and memory footprints (T411058) [05:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:47] T411058: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058 [05:42:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T0600). [06:04:58] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:08:45] (03CR) 10DSantamaria: "Approved! cc @dzahn@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [06:11:22] (03CR) 10DSantamaria: [C:03+1] Add lerickson and trueg to analytics-wikidata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [06:12:45] 06SRE, 10SRE-Access-Requests, 07OKR-Work, 13Patch-For-Review, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency / Add lerickson and trueg to analytics-wikidata-users - https://phabricator.wikimedia.org/T418723#11700716 (10DSantamaria) A... [06:14:58] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:00:05] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:12:53] (03PS1) 10Ayounsi: cr-cloud-hosts ACLs, remove puppetmaster_group [homer/public] - 10https://gerrit.wikimedia.org/r/1250941 [07:29:41] (03CR) 10Majavah: [C:03+1] cr-cloud-hosts ACLs, remove puppetmaster_group [homer/public] - 10https://gerrit.wikimedia.org/r/1250941 (owner: 10Ayounsi) [07:35:03] (03PS1) 10Slyngshede: P:idp switch default OIDC profile format to FLAT [puppet] - 10https://gerrit.wikimedia.org/r/1250944 [07:37:09] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1250944 (owner: 10Slyngshede) [07:39:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [homer/public] - 10https://gerrit.wikimedia.org/r/1250941 (owner: 10Ayounsi) [07:39:40] (03CR) 10Ayounsi: [C:03+2] cr-cloud-hosts ACLs, remove puppetmaster_group [homer/public] - 10https://gerrit.wikimedia.org/r/1250941 (owner: 10Ayounsi) [07:41:01] (03Merged) 10jenkins-bot: cr-cloud-hosts ACLs, remove puppetmaster_group [homer/public] - 10https://gerrit.wikimedia.org/r/1250941 (owner: 10Ayounsi) [07:44:35] (03PS2) 10Slyngshede: P:idp switch default OIDC profile format to FLAT [puppet] - 10https://gerrit.wikimedia.org/r/1250944 [07:45:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ncredir4003.ulsfo.wmnet [07:46:02] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1250944 (owner: 10Slyngshede) [07:51:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir4003.ulsfo.wmnet [07:53:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ncredir4004.ulsfo.wmnet [07:57:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir4004.ulsfo.wmnet [08:16:10] (03PS1) 10Muehlenhoff: Add tcp-proxy4003/4004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1251002 (https://phabricator.wikimedia.org/T418993) [08:16:24] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: de-duplicate active_host checks [puppet] - 10https://gerrit.wikimedia.org/r/1250571 (owner: 10Jelto) [08:19:58] FIRING: [3x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:20:47] (03CR) 10Jelto: [V:03+1 C:03+2] "also noop after a `run-puppet-agent` on the hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1250571 (owner: 10Jelto) [08:21:15] FIRING: ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:21:46] (03CR) 10Arnaudb: [C:03+2] mailman: move mailman-web behind CDN [dns] - 10https://gerrit.wikimedia.org/r/1249310 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [08:21:55] !log arnaudb@dns1004 START - running authdns-update [08:23:21] !log arnaudb@dns1004 END - running authdns-update [08:23:27] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind CDN - https://phabricator.wikimedia.org/T286066#11701018 (10ABran-WMF) [08:25:58] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Release - T419712 [08:26:15] RESOLVED: ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:26:44] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind CDN - https://phabricator.wikimedia.org/T286066#11701033 (10ABran-WMF) mailman-web is now behind CDN: ` $ dig -x $(dig A lists.wikimedia.org +short) +short text-lb.drmrs.wikimedia.o... [08:28:36] (03CR) 10Jcrespo: "does hieradata/codfw/profile/swift/proxy.yaml need updating too, according to docs?" [puppet] - 10https://gerrit.wikimedia.org/r/1250609 (https://phabricator.wikimedia.org/T416243) (owner: 10MVernon) [08:29:11] (03PS1) 10Esanders: Deploy EditCheck suggestion mode at all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251005 (https://phabricator.wikimedia.org/T415320) [08:29:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251005 (https://phabricator.wikimedia.org/T415320) (owner: 10Esanders) [08:32:23] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts netflow4002.ulsfo.wmnet [08:32:46] (03CR) 10Jcrespo: [C:03+1] swift: add 4 new codfw frontends [puppet] - 10https://gerrit.wikimedia.org/r/1250609 (https://phabricator.wikimedia.org/T416243) (owner: 10MVernon) [08:33:57] (03CR) 10Jcrespo: [C:03+1] "I git-grepped ms-fe2020 to confirm." [puppet] - 10https://gerrit.wikimedia.org/r/1250609 (https://phabricator.wikimedia.org/T416243) (owner: 10MVernon) [08:35:01] (03CR) 10Ayounsi: [C:03+1] Add tcp-proxy4003/4004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1251002 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [08:37:01] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:37:35] (03Abandoned) 10DCausse: team-search: use deriv instead of rate for flink metrics [alerts] - 10https://gerrit.wikimedia.org/r/1058176 (owner: 10DCausse) [08:37:53] (03CR) 10MVernon: [C:03+2] swift: add 4 new codfw frontends [puppet] - 10https://gerrit.wikimedia.org/r/1250609 (https://phabricator.wikimedia.org/T416243) (owner: 10MVernon) [08:38:39] FIRING: [2x] JobUnavailable: Reduced availability for job fastnetmon in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:39:35] (03PS1) 10Slyngshede: IDP: failover to idp2005 [dns] - 10https://gerrit.wikimedia.org/r/1251007 [08:42:37] jmm@cumin2002 decommission (PID 1759273) is awaiting input [08:43:39] FIRING: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:49:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1251007 (owner: 10Slyngshede) [08:49:57] (03CR) 10Slyngshede: [C:03+2] IDP: failover to idp2005 [dns] - 10https://gerrit.wikimedia.org/r/1251007 (owner: 10Slyngshede) [08:50:06] !log slyngshede@dns1004 START - running authdns-update [08:51:28] !log slyngshede@dns1004 END - running authdns-update [08:53:39] RESOLVED: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:55:45] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.692 second response time https://wikitech.wikimedia.org/wiki/Swift [08:55:45] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Swift [08:55:45] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.202 second response time https://wikitech.wikimedia.org/wiki/Swift [08:55:47] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 3.046 second response time https://wikitech.wikimedia.org/wiki/Swift [08:56:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow4002.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:59:53] jmm@cumin2002 decommission (PID 1759273) is awaiting input [09:01:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow4002.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:01:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:01:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow4002.ulsfo.wmnet [09:01:17] (03CR) 10Muehlenhoff: [C:03+2] Add tcp-proxy4003/4004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1251002 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [09:01:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11701178 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `netflow4002.ulsfo.wmnet` - netflow4002.ulsfo.wmnet (**PASS**... [09:02:00] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:03:38] (03PS1) 10Ayounsi: routed ganeti durum, don't peer with the core routers [puppet] - 10https://gerrit.wikimedia.org/r/1251010 (https://phabricator.wikimedia.org/T418993) [09:03:40] (03PS1) 10Arnaudb: mailman: update the web frontend firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1251009 (https://phabricator.wikimedia.org/T286066) [09:03:40] (03CR) 10Arnaudb: "pcc output visible here:" [puppet] - 10https://gerrit.wikimedia.org/r/1251009 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [09:04:06] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251010 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [09:04:20] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind CDN - https://phabricator.wikimedia.org/T286066#11701182 (10ABran-WMF) [09:05:36] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in thanos-be2008 - https://phabricator.wikimedia.org/T419817 (10MatthewVernon) 03NEW [09:05:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in thanos-be2008 - https://phabricator.wikimedia.org/T419817#11701195 (10MatthewVernon) p:05Triage→03High [09:06:39] (03PS1) 10David Caro: wmcs,neutron: fail gracefully if neutron admin down does note exist [alerts] - 10https://gerrit.wikimedia.org/r/1251011 [09:07:54] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11701203 (10ABran-WMF) with {T286066} done it should be better: >>! In T286066#11701033, @ABran-WMF wrote: > mailman-... [09:08:21] (03CR) 10Muehlenhoff: "Looks good, but could you please also add the same for doh4003, doh4004, dns4005, dns4006, hcaptcha-proxy4003, hcaptcha-proxy4004?" [puppet] - 10https://gerrit.wikimedia.org/r/1251010 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [09:08:22] (03CR) 10CI reject: [V:04-1] wmcs,neutron: fail gracefully if neutron admin down does note exist [alerts] - 10https://gerrit.wikimedia.org/r/1251011 (owner: 10David Caro) [09:12:06] (03PS2) 10David Caro: wmcs,neutron: fail gracefully if neutron admin down does note exist [alerts] - 10https://gerrit.wikimedia.org/r/1251011 [09:12:25] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11701217 (10Reedy) I can’t login either {F72815272} [09:15:01] (03CR) 10Muehlenhoff: [C:03+1] "Looks good (we'll take care of the rest once confirmed)" [puppet] - 10https://gerrit.wikimedia.org/r/1251010 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [09:15:12] (03CR) 10Ayounsi: [C:03+2] routed ganeti durum, don't peer with the core routers [puppet] - 10https://gerrit.wikimedia.org/r/1251010 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [09:18:30] RECOVERY - Bird Internet Routing Daemon on durum4003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:18:52] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11701236 (10ABran-WMF) >>! In T353891#11701217, @Reedy wrote: > I can’t login either > > {F72815272} Interesting, I'... [09:18:53] (03CR) 10Filippo Giunchedi: [C:03+1] wmcs,neutron: fail gracefully if neutron admin down does note exist [alerts] - 10https://gerrit.wikimedia.org/r/1251011 (owner: 10David Caro) [09:20:18] (03CR) 10David Caro: [C:03+2] wmcs,neutron: fail gracefully if neutron admin down does note exist [alerts] - 10https://gerrit.wikimedia.org/r/1251011 (owner: 10David Caro) [09:20:49] PROBLEM - Host ms-fe2024 is DOWN: PING CRITICAL - Packet loss = 100% [09:21:09] PROBLEM - Host ms-fe2023 is DOWN: PING CRITICAL - Packet loss = 100% [09:21:17] PROBLEM - Host ms-fe2022 is DOWN: PING CRITICAL - Packet loss = 100% [09:21:17] PROBLEM - Host ms-fe2021 is DOWN: PING CRITICAL - Packet loss = 100% [09:21:48] (03Merged) 10jenkins-bot: wmcs,neutron: fail gracefully if neutron admin down does note exist [alerts] - 10https://gerrit.wikimedia.org/r/1251011 (owner: 10David Caro) [09:21:55] RECOVERY - Host ms-fe2022 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [09:21:55] RECOVERY - Host ms-fe2021 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [09:21:55] RECOVERY - Host ms-fe2023 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [09:22:39] RECOVERY - Host ms-fe2024 is UP: PING OK - Packet loss = 0%, RTA = 30.30 ms [09:22:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host durum4003.ulsfo.wmnet [09:23:16] aokoth@cumin1003 aokoth: The backup on gitlab1004 is complete, ready to proceed with upgrade. [09:23:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy4003.ulsfo.wmnet [09:23:57] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:26:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:26:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host durum4003.ulsfo.wmnet [09:27:28] (03PS1) 10David Caro: site: check if role is defined instead of false [puppet] - 10https://gerrit.wikimedia.org/r/1251012 [09:28:02] !log roll-restart codfw ms frontends prior to pooling new ones T416243 [09:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:05] T416243: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243 [09:28:45] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on P{ms-fe[2009-2020].codfw.wmnet} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [09:28:55] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [09:29:04] (03PS2) 10David Caro: site: check if role is defined instead of false [puppet] - 10https://gerrit.wikimedia.org/r/1251012 [09:29:07] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251012 (owner: 10David Caro) [09:29:55] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 115577 bytes in 0.531 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [09:30:05] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy4003.ulsfo.wmnet - jmm@cumin2002" [09:30:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy4003.ulsfo.wmnet - jmm@cumin2002" [09:30:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:30:11] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy4003.ulsfo.wmnet on all recursors [09:30:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy4003.ulsfo.wmnet on all recursors [09:30:45] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy4003.ulsfo.wmnet - jmm@cumin2002" [09:30:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy4003.ulsfo.wmnet - jmm@cumin2002" [09:32:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy4003.ulsfo.wmnet with OS trixie [09:32:38] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Release - T419712 [09:32:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host durum4004.ulsfo.wmnet [09:33:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:06] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11701257 (10Reedy) Yup, same behaviour. Not that I’ve logged in on my mobile in a while anyway… nor was it a tab I alr... [09:35:00] (03PS3) 10David Caro: site: check if role is defined instead of false [puppet] - 10https://gerrit.wikimedia.org/r/1251012 [09:35:00] (03CR) 10David Caro: site: check if role is defined instead of false (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251012 (owner: 10David Caro) [09:35:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on P{ms-fe[2009-2020].codfw.wmnet} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [09:35:38] (03CR) 10David Caro: "That was a draft that got submitted by mistake, ignore XD" [puppet] - 10https://gerrit.wikimedia.org/r/1251012 (owner: 10David Caro) [09:35:40] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251012 (owner: 10David Caro) [09:36:23] (03CR) 10Dpogorzelski: [C:03+1] profile::logstash: drop kserve-controller's logs [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [09:38:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:31] !log btullis@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [09:38:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [09:39:03] !log btullis@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [09:39:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1011.eqiad.wmnet with OS bookworm [09:39:15] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Post reimage - btullis@cumin1003" [09:39:16] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11701269 (10ayounsi) [09:39:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Post reimage - btullis@cumin1003" [09:39:22] !log mvernon@cumin2002 conftool action : set/weight=40; selector: name=ms-fe2021.codfw.wmnet [09:39:34] !log mvernon@cumin2002 conftool action : set/weight=40; selector: name=ms-fe2022.codfw.wmnet [09:39:45] !log mvernon@cumin2002 conftool action : set/weight=40; selector: name=ms-fe2023.codfw.wmnet [09:39:52] !log mvernon@cumin2002 conftool action : set/weight=40; selector: name=ms-fe2024.codfw.wmnet [09:40:03] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: name=ms-fe2021.codfw.wmnet [09:40:11] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: name=ms-fe2022.codfw.wmnet [09:40:20] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: name=ms-fe2023.codfw.wmnet [09:40:27] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: name=ms-fe2024.codfw.wmnet [09:41:00] (03PS1) 10Btullis: Revert "Set dse-k8s-worker101[0-1] into insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1251019 [09:41:41] (03CR) 10David Caro: [V:03+1] "The PCC failure is as expected, as that node does not have a role." [puppet] - 10https://gerrit.wikimedia.org/r/1251012 (owner: 10David Caro) [09:42:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:24] (03PS2) 10Arnaudb: mailman: increase envoy timeout [puppet] - 10https://gerrit.wikimedia.org/r/1251016 (https://phabricator.wikimedia.org/T286066) [09:45:24] (03CR) 10Arnaudb: [C:03+2] "from: https://puppet-compiler.wmflabs.org/output/1251016/6037/lists1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1251016 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [09:47:34] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: enable Mar2026 policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250676 (https://phabricator.wikimedia.org/T417778) (owner: 10Daniel Kinzler) [09:48:33] (03CR) 10Btullis: [C:03+2] Revert "Set dse-k8s-worker101[0-1] into insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1251019 (owner: 10Btullis) [09:48:56] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/SERVICE_NAME: apply [09:48:58] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/SERVICE_NAME: apply [09:49:14] (03PS1) 10Muehlenhoff: doh/hcaptcha-proxy on routed Ganeti/ulsfo: Don't peer with the core routers [puppet] - 10https://gerrit.wikimedia.org/r/1251021 (https://phabricator.wikimedia.org/T418993) [09:50:47] (03PS1) 10Joal: Update dse-k8s airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251022 (https://phabricator.wikimedia.org/T419540) [09:52:15] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, and 2 others: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11701314 (10ABran-WMF) https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ is available... [09:52:31] (03PS1) 10Kgraessle: PersonalDashboard edit count configurations should have an upper bound and limit personal tools menu access too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251023 (https://phabricator.wikimedia.org/T418365) [09:53:10] (03PS2) 10Muehlenhoff: doh/hcaptcha-proxy on routed Ganeti/ulsfo: Don't peer with the core routers [puppet] - 10https://gerrit.wikimedia.org/r/1251021 (https://phabricator.wikimedia.org/T418993) [09:53:54] (03PS8) 10Blake: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) [09:54:45] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for Jerrywang - https://phabricator.wikimedia.org/T419820 (10BTullis) 03NEW [09:55:26] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for Jerrywang - https://phabricator.wikimedia.org/T419820#11701331 (10BTullis) [09:55:43] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for Jerrywang - https://phabricator.wikimedia.org/T419820#11701333 (10BTullis) p:05Triage→03Medium [09:55:55] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Requesting access to analytics-admins for Jerrywang - https://phabricator.wikimedia.org/T419820#11701334 (10BTullis) [09:56:11] (03CR) 10Ayounsi: [C:03+1] doh/hcaptcha-proxy on routed Ganeti/ulsfo: Don't peer with the core routers [puppet] - 10https://gerrit.wikimedia.org/r/1251021 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [09:56:14] (03CR) 10Aqu: [C:03+1] "Good. It matches the image produced by the gitlab ci." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251022 (https://phabricator.wikimedia.org/T419540) (owner: 10Joal) [09:56:45] 06SRE: Add a not active server warning to mwlog servers - https://phabricator.wikimedia.org/T419821 (10Urbanecm_WMF) 03NEW [09:57:30] (03PS1) 10Majavah: wmcs: neutron: Disable Pint series check for NeutronAgentAdminDown [alerts] - 10https://gerrit.wikimedia.org/r/1251024 [09:57:45] (03PS3) 10Muehlenhoff: doh/hcaptcha-proxy on routed Ganeti/ulsfo: Don't peer with the core routers [puppet] - 10https://gerrit.wikimedia.org/r/1251021 (https://phabricator.wikimedia.org/T418993) [09:58:15] (03PS2) 10Kgraessle: PersonalDashboard edit count configurations should have an upper bound and limit personal tools menu access too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251023 (https://phabricator.wikimedia.org/T418365) [09:58:29] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy4003.ulsfo.wmnet with reason: host reimage [09:58:33] (03CR) 10Ayounsi: [C:03+1] doh/hcaptcha-proxy on routed Ganeti/ulsfo: Don't peer with the core routers [puppet] - 10https://gerrit.wikimedia.org/r/1251021 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [09:58:39] FIRING: [3x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:58:43] 06SRE, 06SRE Observability: Add a not active server warning to mwlog servers - https://phabricator.wikimedia.org/T419821#11701350 (10taavi) [09:59:28] (03CR) 10CI reject: [V:04-1] PersonalDashboard edit count configurations should have an upper bound and limit personal tools menu access too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251023 (https://phabricator.wikimedia.org/T418365) (owner: 10Kgraessle) [09:59:58] FIRING: [5x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T1000) [10:00:14] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1010.eqiad.wmnet [10:01:03] (03PS3) 10Kgraessle: PersonalDashboard edit count configurations should have an upper bound and limit personal tools menu access too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251023 (https://phabricator.wikimedia.org/T418365) [10:03:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy4003.ulsfo.wmnet with reason: host reimage [10:04:58] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:05:41] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11701359 (10tappof) [10:06:33] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: enable Mar2026 policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250676 (https://phabricator.wikimedia.org/T417778) (owner: 10Daniel Kinzler) [10:06:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1010.eqiad.wmnet [10:06:51] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1011.eqiad.wmnet [10:08:39] FIRING: [3x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:09:00] (03Merged) 10jenkins-bot: rest-gateway: enable Mar2026 policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250676 (https://phabricator.wikimedia.org/T417778) (owner: 10Daniel Kinzler) [10:09:31] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:10:22] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists via HyperKitty web ui - https://phabricator.wikimedia.org/T386559#11701364 (10Ladsgroup) I use web posting from time to time (when I'm subscribed... [10:10:23] (03CR) 10Muehlenhoff: [C:03+2] doh/hcaptcha-proxy on routed Ganeti/ulsfo: Don't peer with the core routers [puppet] - 10https://gerrit.wikimedia.org/r/1251021 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [10:10:30] (03CR) 10David Caro: [C:03+1] wmcs: neutron: Disable Pint series check for NeutronAgentAdminDown [alerts] - 10https://gerrit.wikimedia.org/r/1251024 (owner: 10Majavah) [10:10:41] (03CR) 10Majavah: [C:03+2] wmcs: neutron: Disable Pint series check for NeutronAgentAdminDown [alerts] - 10https://gerrit.wikimedia.org/r/1251024 (owner: 10Majavah) [10:10:52] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:11:15] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:11:18] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:11:23] (03PS5) 10Federico Ceratto: Switch linting to ruff [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 [10:12:12] (03Merged) 10jenkins-bot: wmcs: neutron: Disable Pint series check for NeutronAgentAdminDown [alerts] - 10https://gerrit.wikimedia.org/r/1251024 (owner: 10Majavah) [10:12:23] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:12:47] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1011.eqiad.wmnet [10:12:57] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:13:01] (03PS1) 10Muehlenhoff: Remove ferm rule for netflow4002 [puppet] - 10https://gerrit.wikimedia.org/r/1251028 (https://phabricator.wikimedia.org/T418993) [10:13:39] FIRING: [5x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:13:56] (03CR) 10CI reject: [V:04-1] Switch linting to ruff [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 (owner: 10Federico Ceratto) [10:14:58] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:16:01] (03PS1) 10Arnaudb: mailman: keep SECURE_PROXY_SSL_HEADER on X-Forwarded-Proto [puppet] - 10https://gerrit.wikimedia.org/r/1251026 (https://phabricator.wikimedia.org/T353891) [10:16:01] (03CR) 10Arnaudb: [C:03+2] "pcc output visible here: https://puppet-compiler.wmflabs.org/output/1251026/6038/lists1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1251026 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [10:16:25] jouncebot nowandnext [10:16:25] For the next 0 hour(s) and 43 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T1000) [10:16:25] In 1 hour(s) and 43 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T1200) [10:17:32] (03PS1) 10Slyngshede: hiera: use short for meassures [puppet] - 10https://gerrit.wikimedia.org/r/1251029 [10:18:05] (03CR) 10CI reject: [V:04-1] hiera: use short for meassures [puppet] - 10https://gerrit.wikimedia.org/r/1251029 (owner: 10Slyngshede) [10:18:49] (03PS6) 10Federico Ceratto: Switch linting to ruff [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 [10:20:25] (03CR) 10Ayounsi: [C:03+1] Remove ferm rule for netflow4002 [puppet] - 10https://gerrit.wikimedia.org/r/1251028 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [10:20:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy4003.ulsfo.wmnet with OS trixie [10:20:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy4003.ulsfo.wmnet [10:21:43] (03CR) 10Muehlenhoff: [C:03+2] Remove ferm rule for netflow4002 [puppet] - 10https://gerrit.wikimedia.org/r/1251028 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [10:21:56] (03CR) 10CI reject: [V:04-1] Switch linting to ruff [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 (owner: 10Federico Ceratto) [10:22:58] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1013 [10:22:58] (03PS1) 10Phuedx: ext.testKitchen: Depend on mediawiki.user module [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251031 [10:23:11] (03PS1) 10Phuedx: ext.testKitchen: Depend on mediawiki.user module [extensions/TestKitchen] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251032 [10:23:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1013 [10:24:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251031 (owner: 10Phuedx) [10:24:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251031 (owner: 10Phuedx) [10:25:00] jmm@cumin2002 makevm (PID 1782660) is awaiting input [10:25:33] PROBLEM - Host dse-k8s-worker1013 is DOWN: PING CRITICAL - Packet loss = 100% [10:25:49] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:26:16] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:26:44] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:26:45] (03PS1) 10Btullis: Temporarily set dse-k8s-worker1013 into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1251033 (https://phabricator.wikimedia.org/T414787) [10:27:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy4004.ulsfo.wmnet [10:27:09] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:28:21] (03CR) 10Btullis: [C:03+2] Temporarily set dse-k8s-worker1013 into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1251033 (https://phabricator.wikimedia.org/T414787) (owner: 10Btullis) [10:28:51] (03CR) 10CI reject: [V:04-1] ext.testKitchen: Depend on mediawiki.user module [extensions/TestKitchen] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251032 (owner: 10Phuedx) [10:29:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy4003.ulsfo.wmnet [10:29:58] FIRING: [3x] KubernetesCalicoDown: dse-k8s-worker1011.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:30:01] (03CR) 10Phuedx: "Recheck" [extensions/TestKitchen] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251032 (owner: 10Phuedx) [10:30:16] !log repooling ncredir4003 & ncredir4004 [10:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:35] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, and 2 others: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11701412 (10ABran-WMF) >>! In T353891#11701257, @Reedy wrote: > Yup, same behaviour. Not that I’ve logged in on my mobi... [10:30:45] !log vgutierrez@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir4003.ulsfo.wmnet [10:31:07] !log vgutierrez@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir4004.ulsfo.wmnet [10:31:32] !log vgutierrez@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir4002.ulsfo.wmnet [10:31:36] !log vgutierrez@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir4001.ulsfo.wmnet [10:32:01] (03PS1) 10Matthias Mullie: Update CSS selector for Mobile TOC button [extensions/MobileFrontend] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251034 (https://phabricator.wikimedia.org/T419587) [10:32:12] (03PS1) 10Matthias Mullie: Update CSS selector for Mobile TOC button [extensions/MobileFrontend] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251035 (https://phabricator.wikimedia.org/T419587) [10:32:49] jmm@cumin2002 makevm (PID 1782660) is awaiting input [10:33:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy4003.ulsfo.wmnet [10:35:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:36:03] (03PS1) 10Matthias Mullie: Remove queueing logic [extensions/ReaderExperiments] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251036 (https://phabricator.wikimedia.org/T419587) [10:36:13] (03CR) 10CI reject: [V:04-1] Update CSS selector for Mobile TOC button [extensions/MobileFrontend] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251034 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [10:36:14] (03PS1) 10Matthias Mullie: Remove queueing logic [extensions/ReaderExperiments] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251037 (https://phabricator.wikimedia.org/T419587) [10:36:14] (03CR) 10CI reject: [V:04-1] Update CSS selector for Mobile TOC button [extensions/MobileFrontend] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251035 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [10:37:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251036 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [10:37:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251037 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [10:37:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/MobileFrontend] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251034 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [10:38:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/MobileFrontend] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251035 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [10:39:09] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11701433 (10Jelto) I can login and logout normally on firefox, I can't reproduce the CSRF error [10:41:56] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device asw1-23-ulsfo [10:42:12] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-23-ulsfo [10:46:10] (03PS1) 10Trueg: dse-k8s-eqiad: wdqs-queryhammer namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251040 (https://phabricator.wikimedia.org/T417415) [10:49:36] (03PS7) 10Federico Ceratto: Switch linting to ruff [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 [10:52:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/TestKitchen] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251032 (owner: 10Phuedx) [10:54:04] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11701471 (10ayounsi) I think the factory reset helped. I then temporarily copied the TLS config from asw1-22, and ran the TLS cookbook and we're all good. So now... [10:56:20] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy4004.ulsfo.wmnet - jmm@cumin2002" [10:56:31] (03PS1) 10Arnaudb: arnaudb: update PATH variable in bashrc [puppet] - 10https://gerrit.wikimedia.org/r/1251041 [10:59:25] jmm@cumin2002 makevm (PID 1782660) is awaiting input [11:00:26] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [11:03:04] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:04:30] (03PS1) 10JMeybohm: apparmor/wikifunctions: Update profiles to containerd 1.7 defaults [puppet] - 10https://gerrit.wikimedia.org/r/1251046 (https://phabricator.wikimedia.org/T419781) [11:05:00] (03CR) 10Jelto: [C:03+1] "change and PCC diff looks reasonable to me" [puppet] - 10https://gerrit.wikimedia.org/r/1251009 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [11:06:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy4004.ulsfo.wmnet - jmm@cumin2002" [11:06:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:06:03] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy4004.ulsfo.wmnet on all recursors [11:06:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy4004.ulsfo.wmnet on all recursors [11:06:36] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy4004.ulsfo.wmnet - jmm@cumin2002" [11:06:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy4004.ulsfo.wmnet - jmm@cumin2002" [11:07:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy4004.ulsfo.wmnet with OS trixie [11:08:07] (03CR) 10Phuedx: "Recheck" [extensions/TestKitchen] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251032 (owner: 10Phuedx) [11:11:08] (03PS1) 10Phuedx: Add title to the request context in FlaggedRevsCacheTest [extensions/FlaggedRevs] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251048 (https://phabricator.wikimedia.org/T419539) [11:11:32] (03PS2) 10Phuedx: ext.testKitchen: Depend on mediawiki.user module [extensions/TestKitchen] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251032 [11:11:54] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11701532 (10Reedy) Mobile FF does the same for me… [11:12:07] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1250944 (owner: 10Slyngshede) [11:12:18] (03PS21) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [11:14:19] (03PS1) 10Blake: mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) [11:14:19] (03CR) 10Blake: "Hey Scott! I had one question about when this ought to be deployed - should that happen on ~the Monday prior to the switchover? Or some ot" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [11:19:38] !log disabled puppet on all wikikube worker nodes to rollout/test new apparmor profiles in staging - T419781 [11:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:42] T419781: Lots of Wikifunctions k8s pods in staging stuck in "Terminating", some for 14 days+ - https://phabricator.wikimedia.org/T419781 [11:19:53] (03CR) 10JMeybohm: [C:03+2] apparmor/wikifunctions: Update profiles to containerd 1.7 defaults [puppet] - 10https://gerrit.wikimedia.org/r/1251046 (https://phabricator.wikimedia.org/T419781) (owner: 10JMeybohm) [11:28:08] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy4004.ulsfo.wmnet with reason: host reimage [11:34:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy4004.ulsfo.wmnet with reason: host reimage [11:34:38] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new, 07Essential-Work: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11701583 (10elukey) @ecarg I checked a little bit more the logs and the 504s are registered in the... [11:45:30] (03PS10) 10Effie Mouzeli: Update chart metadata for various charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250597 (https://phabricator.wikimedia.org/T412693) [11:49:24] (03CR) 10Sergio Gimeno: [C:03+1] cleanup: Growth: Remove temporary GrowthMentorList overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518) (owner: 10Urbanecm) [11:49:39] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1013.eqiad.wmnet with OS bookworm [11:50:04] Anyone else seeing CI failures caused by FlaggedRevsCacheTest on cherry-picks to -wmf.19? [11:50:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy4004.ulsfo.wmnet with OS trixie [11:50:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy4004.ulsfo.wmnet [11:50:38] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1013.eqiad.wmnet with OS bookworm [11:51:04] I fixed the failure by cherry picking https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/1249988 to -wmf.19 and depending on it [11:51:26] (03CR) 10Muehlenhoff: "Looks good, few nits inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [11:51:28] I'm just wondering if anyone else has fixed it in a different way [11:51:42] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1013.eqiad.wmnet with OS bookworm [11:51:53] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1013.eqiad.wmnet with OS bookworm [11:52:23] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1013.eqiad.wmnet with OS bookworm [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T1200) [12:07:36] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1013.eqiad.wmnet with reason: host reimage [12:13:27] (03Abandoned) 10Daniel Kinzler: Revert: rest-gateway: enable Mar2026 policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250679 (https://phabricator.wikimedia.org/T417778) (owner: 10Daniel Kinzler) [12:14:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1013.eqiad.wmnet with reason: host reimage [12:14:30] !log installing wireshark security updates [12:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:01] (03PS1) 10Effie Mouzeli: role::mediawiki::memcached::wikifunctions: add new role [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) [12:15:09] (03PS2) 10Daniel Kinzler: rest gateway: add second Lua filter for header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250675 (https://phabricator.wikimedia.org/T418969) [12:17:13] (03CR) 10CI reject: [V:04-1] role::mediawiki::memcached::wikifunctions: add new role [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) (owner: 10Effie Mouzeli) [12:18:50] (03PS22) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [12:19:09] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1251061 (owner: 10L10n-bot) [12:21:47] (03PS23) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [12:25:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy4004.ulsfo.wmnet [12:28:02] !log installing postgresql-17 security updates [12:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy4004.ulsfo.wmnet [12:30:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host durum4004.ulsfo.wmnet [12:31:58] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [12:33:20] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [12:33:20] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1013.eqiad.wmnet with OS bookworm [12:35:38] (03PS1) 10Btullis: Revert "Temporarily set dse-k8s-worker1013 into insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1251071 [12:39:44] (03PS1) 10Btullis: Create analytics-fr-tech system user and corresponding group [puppet] - 10https://gerrit.wikimedia.org/r/1251072 (https://phabricator.wikimedia.org/T417213) [12:40:32] (03CR) 10CI reject: [V:04-1] Create analytics-fr-tech system user and corresponding group [puppet] - 10https://gerrit.wikimedia.org/r/1251072 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [12:42:19] (03CR) 10Btullis: [C:03+2] Revert "Temporarily set dse-k8s-worker1013 into insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1251071 (owner: 10Btullis) [12:45:07] (03CR) 10Btullis: [C:03+2] topic: mw-page-edit-type-enrich-next [puppet] - 10https://gerrit.wikimedia.org/r/1249957 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [12:47:49] (03PS1) 10Daniel Kinzler: rest gateway: configure known cg-nat addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251074 [12:49:10] !log dpogorzelski@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [12:49:19] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1013.eqiad.wmnet [12:49:34] !log dpogorzelski@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [12:49:59] (03CR) 10Muehlenhoff: role::mediawiki::memcached::wikifunctions: add new role (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) (owner: 10Effie Mouzeli) [12:50:02] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Requesting access to analytics-admins for Jerrywang - https://phabricator.wikimedia.org/T419820#11701847 (10Aklapper) @JerryWang-WMF: Please also [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org... [12:51:55] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11701857 (10Jclark-ctr) [12:53:59] (03CR) 10Kamila Součková: [C:03+1] rest gateway: configure known cg-nat addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251074 (owner: 10Daniel Kinzler) [12:54:06] (03CR) 10Majavah: "As I said on the task, IMHO this is not necessary anymore as the relevant bug has been fixed on the libera.chat side" [puppet] - 10https://gerrit.wikimedia.org/r/1249506 (https://phabricator.wikimedia.org/T419190) (owner: 10Voidwalker) [12:55:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1013.eqiad.wmnet [12:55:37] (03PS1) 10Muehlenhoff: Add tcp-proxy4003/tcp-proxy4004 [puppet] - 10https://gerrit.wikimedia.org/r/1251075 (https://phabricator.wikimedia.org/T418993) [12:57:28] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1159.eqiad.wmnet [12:57:57] (03PS6) 10AKhatun: stream: mediawiki.page_edit_type_simple [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249360 (https://phabricator.wikimedia.org/T351225) [12:59:04] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: configure known cg-nat addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251074 (owner: 10Daniel Kinzler) [13:00:05] Urbanecm and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T1300). Please do the needful. [13:00:05] katherine_g, edsanders, phuedx, and matthiasmullie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:13] o/ [13:00:16] o/ [13:00:57] I can start with mine if thats good with you all? [13:01:09] (03Merged) 10jenkins-bot: rest gateway: configure known cg-nat addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251074 (owner: 10Daniel Kinzler) [13:01:16] 👍 [13:01:17] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11701923 (10Jclark-ctr) the Fundraising Server Provision for Dual-Switch Uplinks script in netbox is it setup to only look for old codfw currently reached out to Cathal by... [13:01:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kgraessle@deploy2002 using scap backport" [extensions/AutoModerator] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250656 (https://phabricator.wikimedia.org/T419718) (owner: 10Kgraessle) [13:02:18] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:02:19] (03PS2) 10Xcollazo: Disable rsync access for two dead dump mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1250070 (https://phabricator.wikimedia.org/T415193) [13:02:25] RESOLVED: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:25] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:02:33] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1250070 (https://phabricator.wikimedia.org/T415193) (owner: 10Xcollazo) [13:02:33] (03CR) 10Matthias Mullie: "recheck" [extensions/MobileFrontend] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251035 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [13:02:35] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [13:02:39] (03CR) 10Matthias Mullie: "recheck" [extensions/MobileFrontend] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251034 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [13:03:29] o/ [13:03:56] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [13:05:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1159.eqiad.wmnet [13:06:17] (03Merged) 10jenkins-bot: Add multilingual revert risk host header for LiftWing requests [extensions/AutoModerator] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250656 (https://phabricator.wikimedia.org/T419718) (owner: 10Kgraessle) [13:06:23] (03CR) 10Xcollazo: [C:03+1] "PPC run looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1250070 (https://phabricator.wikimedia.org/T415193) (owner: 10Xcollazo) [13:06:41] (03PS1) 10Muehlenhoff: Add library hint for psql 17 [puppet] - 10https://gerrit.wikimedia.org/r/1251076 [13:06:54] (03PS2) 10Muehlenhoff: Add library hint for psql 17 [puppet] - 10https://gerrit.wikimedia.org/r/1251076 [13:07:18] (03CR) 10Elukey: "@dpogorzelski@wikimedia.org @tklausmann@wikimedia.org could you please add tests as described in https://wikitech.wikimedia.org/wiki/Logst" [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [13:07:19] !log kgraessle@deploy2002 Started scap sync-world: Backport for [[gerrit:1250656|Add multilingual revert risk host header for LiftWing requests (T419718)]] [13:07:23] T419718: Add multilingual revert risk host header for LiftWing requests - https://phabricator.wikimedia.org/T419718 [13:08:46] (03CR) 10Btullis: [C:03+2] Disable rsync access for two dead dump mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1250070 (https://phabricator.wikimedia.org/T415193) (owner: 10Xcollazo) [13:10:12] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for psql 17 [puppet] - 10https://gerrit.wikimedia.org/r/1251076 (owner: 10Muehlenhoff) [13:10:17] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [13:11:51] !log kgraessle@deploy2002 kgraessle: Backport for [[gerrit:1250656|Add multilingual revert risk host header for LiftWing requests (T419718)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:12:14] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [13:13:51] (03CR) 10JHathaway: [C:03+1] site: check if role is defined instead of false [puppet] - 10https://gerrit.wikimedia.org/r/1251012 (owner: 10David Caro) [13:14:00] !log kgraessle@deploy2002 kgraessle: Continuing with sync [13:14:03] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1178.eqiad.wmnet [13:14:03] !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database kaiwiki (T414240) [13:14:10] !log fnegri@cumin1003 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) for database kaiwiki (T414240) [13:14:11] T414240: [wikireplicas] Create views for new wiki kaiwiki - https://phabricator.wikimedia.org/T414240 [13:14:44] (03CR) 10Bking: [C:03+2] wdqs: allow NFS mount from wdqs2009 [puppet] - 10https://gerrit.wikimedia.org/r/1250683 (https://phabricator.wikimedia.org/T415073) (owner: 10Bking) [13:14:47] (03CR) 10Dpogorzelski: [C:03+1] "not at the moment, if this is urgent please merge it to stop the flood :)" [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [13:18:11] !log kgraessle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1250656|Add multilingual revert risk host header for LiftWing requests (T419718)]] (duration: 10m 52s) [13:18:14] T419718: Add multilingual revert risk host header for LiftWing requests - https://phabricator.wikimedia.org/T419718 [13:18:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251005 (https://phabricator.wikimedia.org/T415320) (owner: 10Esanders) [13:18:41] edsanders: finished, over to you [13:18:48] thanks [13:19:03] (03CR) 10Muehlenhoff: [C:03+2] varnish: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1247953 (owner: 10Muehlenhoff) [13:19:38] (03Merged) 10jenkins-bot: Deploy EditCheck suggestion mode at all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251005 (https://phabricator.wikimedia.org/T415320) (owner: 10Esanders) [13:20:06] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1251005|Deploy EditCheck suggestion mode at all Wikipedias (T415320)]] [13:20:10] (03PS2) 10Muehlenhoff: matomo: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1243820 [13:20:14] T415320: [MILESTONE] Deploy Suggestion Mode MVP as a beta feature (all Wikipedias) - https://phabricator.wikimedia.org/T415320 [13:21:24] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:22:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1178.eqiad.wmnet [13:22:16] !log esanders@deploy2002 esanders: Backport for [[gerrit:1251005|Deploy EditCheck suggestion mode at all Wikipedias (T415320)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:22:52] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11702059 (10VRiley-WMF) Recommended changing the tenant to "Fundraising Tech". Once that was set in netbox, the scripts could see those units. [13:22:55] !log esanders@deploy2002 esanders: Continuing with sync [13:23:40] (03CR) 10Arnaudb: [C:03+1] "looks good to me, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1251075 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [13:24:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:24:55] (03PS1) 10Muehlenhoff: Remove obsolete Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1251080 (https://phabricator.wikimedia.org/T350694) [13:25:23] (03CR) 10Muehlenhoff: "Good catch! I made a new patch to remove it entirely: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1251080" [puppet] - 10https://gerrit.wikimedia.org/r/1243824 (owner: 10Muehlenhoff) [13:25:28] (03PS2) 10Muehlenhoff: Remove obsolete Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1251080 (https://phabricator.wikimedia.org/T350694) [13:26:15] (03CR) 10JHathaway: [C:03+1] Remove obsolete Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1251080 (https://phabricator.wikimedia.org/T350694) (owner: 10Muehlenhoff) [13:26:48] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251005|Deploy EditCheck suggestion mode at all Wikipedias (T415320)]] (duration: 06m 42s) [13:26:52] T415320: [MILESTONE] Deploy Suggestion Mode MVP as a beta feature (all Wikipedias) - https://phabricator.wikimedia.org/T415320 [13:27:01] All done [13:27:20] (03CR) 10CI reject: [V:04-1] Remove obsolete Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1251080 (https://phabricator.wikimedia.org/T350694) (owner: 10Muehlenhoff) [13:27:27] @phuedx you're going next? [13:27:38] matthiasmullie: I believe yours depends on mine :) [13:28:11] So yeah [13:28:14] ^^ [13:28:35] (03CR) 10JavierMonton: [C:03+1] stream: mediawiki.page_edit_type_simple [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249360 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [13:28:37] The experiment isn't running yet so we could go either way, but yeah, respecting the correct order of things would be preferred :) [13:29:13] (03CR) 10AKhatun: [C:03+2] stream: mediawiki.page_edit_type_simple [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249360 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [13:30:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251031 (owner: 10Phuedx) [13:30:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/FlaggedRevs] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251048 (https://phabricator.wikimedia.org/T419539) (owner: 10Phuedx) [13:30:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251032 (owner: 10Phuedx) [13:31:10] (03Merged) 10jenkins-bot: stream: mediawiki.page_edit_type_simple [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249360 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [13:31:20] (03PS1) 10Muehlenhoff: varnish: Run spec tests on Bullseye and Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1251081 [13:32:29] (03Merged) 10jenkins-bot: ext.testKitchen: Depend on mediawiki.user module [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251031 (owner: 10Phuedx) [13:32:45] (03CR) 10David Caro: [V:03+1 C:03+2] site: check if role is defined instead of false [puppet] - 10https://gerrit.wikimedia.org/r/1251012 (owner: 10David Caro) [13:34:28] (03Merged) 10jenkins-bot: Add title to the request context in FlaggedRevsCacheTest [extensions/FlaggedRevs] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251048 (https://phabricator.wikimedia.org/T419539) (owner: 10Phuedx) [13:34:35] (03CR) 10David Caro: [V:03+1 C:03+2] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1251012 (owner: 10David Caro) [13:37:17] (03Merged) 10jenkins-bot: ext.testKitchen: Depend on mediawiki.user module [extensions/TestKitchen] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251032 (owner: 10Phuedx) [13:37:52] (03PS2) 10Ebernhardson: opensearch-semantic-search: Increse memory quota to 650G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250686 (https://phabricator.wikimedia.org/T414091) [13:37:53] (03PS2) 10Ebernhardson: opensearch-semantic-search: Scale for additional wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250687 (https://phabricator.wikimedia.org/T414091) [13:37:53] !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1251031|ext.testKitchen: Depend on mediawiki.user module]], [[gerrit:1251048|Add title to the request context in FlaggedRevsCacheTest (T419539)]], [[gerrit:1251032|ext.testKitchen: Depend on mediawiki.user module]] [13:37:57] T419539: Changes in FlaggedRevs seem to be causing CI failures for CheckUser - https://phabricator.wikimedia.org/T419539 [13:39:55] !log phuedx@deploy2002 phuedx: Backport for [[gerrit:1251031|ext.testKitchen: Depend on mediawiki.user module]], [[gerrit:1251048|Add title to the request context in FlaggedRevsCacheTest (T419539)]], [[gerrit:1251032|ext.testKitchen: Depend on mediawiki.user module]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:41:54] I did quick browse of enwiki, dewiki, and frwiki. No errors in the browser console. Continuing... [13:42:01] !log phuedx@deploy2002 phuedx: Continuing with sync [13:44:51] (03CR) 10Brouberol: [C:03+2] Update dse-k8s airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251022 (https://phabricator.wikimedia.org/T419540) (owner: 10Joal) [13:45:42] (03CR) 10Muehlenhoff: [C:03+1] "Thanks! Looks good, merging." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1250619 (https://phabricator.wikimedia.org/T418253) (owner: 10Thcipriani) [13:45:49] (03CR) 10Muehlenhoff: [C:03+2] Blubber: bump blubber to 1.8.1; set setuptools version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1250619 (https://phabricator.wikimedia.org/T418253) (owner: 10Thcipriani) [13:45:55] !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251031|ext.testKitchen: Depend on mediawiki.user module]], [[gerrit:1251048|Add title to the request context in FlaggedRevsCacheTest (T419539)]], [[gerrit:1251032|ext.testKitchen: Depend on mediawiki.user module]] (duration: 08m 01s) [13:45:59] T419539: Changes in FlaggedRevs seem to be causing CI failures for CheckUser - https://phabricator.wikimedia.org/T419539 [13:46:31] matthiasmullie: Over to you [13:46:59] Thanks! [13:47:40] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1251081 (owner: 10Muehlenhoff) [13:48:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251034 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [13:48:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251035 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [13:48:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251036 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [13:48:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251037 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [13:51:05] (03CR) 10Arnaudb: [C:03+1] "thanks for the fix @jwodstrcil@wikimedia.org, looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1250601 (https://phabricator.wikimedia.org/T411895) (owner: 10Jelto) [13:52:27] (03Merged) 10jenkins-bot: Remove queueing logic [extensions/ReaderExperiments] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251036 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [13:52:40] (03PS2) 10Btullis: Create analytics-fr-tech system user and corresponding group [puppet] - 10https://gerrit.wikimedia.org/r/1251072 (https://phabricator.wikimedia.org/T417213) [13:52:57] (03CR) 10Bking: [C:03+2] opensearch-semantic-search: Increse memory quota to 650G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250686 (https://phabricator.wikimedia.org/T414091) (owner: 10Ebernhardson) [13:53:22] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11702186 (10Jclark-ctr) a:03Jclark-ctr [13:53:28] (03CR) 10CI reject: [V:04-1] Create analytics-fr-tech system user and corresponding group [puppet] - 10https://gerrit.wikimedia.org/r/1251072 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [13:53:41] (03Merged) 10jenkins-bot: Remove queueing logic [extensions/ReaderExperiments] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251037 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [13:54:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:54:51] !log installing libssh security updates [13:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:34] (03CR) 10Muehlenhoff: [C:03+2] varnish: Run spec tests on Bullseye and Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1251081 (owner: 10Muehlenhoff) [13:56:24] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11702215 (10Jclark-ctr) [13:56:51] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11702219 (10Jclark-ctr) a:03Jclark-ctr [13:57:04] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:57:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11702225 (10Jclark-ctr) a:03Jclark-ctr [13:57:43] (03PS1) 10Gergő Tisza: Set 'sub' JWT field in client credentials access tokens [extensions/OAuth] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251087 (https://phabricator.wikimedia.org/T417278) [13:58:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/OAuth] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251087 (https://phabricator.wikimedia.org/T417278) (owner: 10Gergő Tisza) [13:58:50] (03PS1) 10Gergő Tisza: Set 'sub' JWT field in client credentials access tokens [extensions/OAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251088 (https://phabricator.wikimedia.org/T417278) [13:59:00] (03PS1) 10Muehlenhoff: cfssl: Run tests on Bullseye and Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1251089 [13:59:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/OAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251088 (https://phabricator.wikimedia.org/T417278) (owner: 10Gergő Tisza) [13:59:15] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:59:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11702238 (10Jclark-ctr) a:05Jhancock.wm→03None [13:59:40] (03PS1) 10Muehlenhoff: civicrm: Run test on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1251090 [13:59:48] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:59:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11702240 (10Jclark-ctr) a:03Jclark-ctr [14:00:07] (03PS3) 10Muehlenhoff: Remove obsolete Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1251080 (https://phabricator.wikimedia.org/T350694) [14:00:23] (03CR) 10CI reject: [V:04-1] civicrm: Run test on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1251090 (owner: 10Muehlenhoff) [14:01:57] (03Merged) 10jenkins-bot: Update CSS selector for Mobile TOC button [extensions/MobileFrontend] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251035 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [14:02:00] (03Merged) 10jenkins-bot: Update CSS selector for Mobile TOC button [extensions/MobileFrontend] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251034 (https://phabricator.wikimedia.org/T419587) (owner: 10Matthias Mullie) [14:02:30] (03CR) 10CI reject: [V:04-1] Set 'sub' JWT field in client credentials access tokens [extensions/OAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251088 (https://phabricator.wikimedia.org/T417278) (owner: 10Gergő Tisza) [14:02:35] !log mlitn@deploy2002 Started scap sync-world: Backport for [[gerrit:1251034|Update CSS selector for Mobile TOC button (T419587)]], [[gerrit:1251035|Update CSS selector for Mobile TOC button (T419587)]], [[gerrit:1251036|Remove queueing logic (T419587)]], [[gerrit:1251037|Remove queueing logic (T419587)]] [14:02:44] T419587: Fix TestKitchen dependency-order issue affecting data collection - https://phabricator.wikimedia.org/T419587 [14:03:23] !log start eqiad rack D2 depools [14:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:29] https://phabricator.wikimedia.org/T419647 [14:04:03] (03PS1) 10Muehlenhoff: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1251091 [14:04:37] !log mlitn@deploy2002 mlitn: Backport for [[gerrit:1251034|Update CSS selector for Mobile TOC button (T419587)]], [[gerrit:1251035|Update CSS selector for Mobile TOC button (T419587)]], [[gerrit:1251036|Remove queueing logic (T419587)]], [[gerrit:1251037|Remove queueing logic (T419587)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:04:40] (03PS2) 10Muehlenhoff: civicrm: Run test on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1251090 [14:04:58] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:05:53] !log mlitn@deploy2002 mlitn: Continuing with sync [14:07:16] !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [14:07:36] !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [14:08:50] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11702312 (10Jclark-ctr) @Clement_Goubert Could you update site.pp it is missing. Also add these to preseed for efi booting also Thanks! [14:08:56] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2008-dev.codfw.wmnet with OS trixie [14:09:03] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephmon2007-dev - https://phabricator.wikimedia.org/T416396#11702314 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd2008-dev.codfw.wmnet with OS trixie executed with errors: - cloudcephosd20... [14:09:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11702315 (10Jclark-ctr) [14:09:50] !log mlitn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251034|Update CSS selector for Mobile TOC button (T419587)]], [[gerrit:1251035|Update CSS selector for Mobile TOC button (T419587)]], [[gerrit:1251036|Remove queueing logic (T419587)]], [[gerrit:1251037|Remove queueing logic (T419587)]] (duration: 07m 15s) [14:09:54] T419587: Fix TestKitchen dependency-order issue affecting data collection - https://phabricator.wikimedia.org/T419587 [14:09:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251080 (https://phabricator.wikimedia.org/T350694) (owner: 10Muehlenhoff) [14:10:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11702339 (10Jclark-ctr) @BTullis could you update preseed filed for these servers for efi booting [14:11:10] (03PS2) 10Muehlenhoff: cloudceph: Run the spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1247033 [14:11:34] All done [14:11:34] (03PS3) 10Btullis: Create analytics-fr-tech system user and corresponding group [puppet] - 10https://gerrit.wikimedia.org/r/1251072 (https://phabricator.wikimedia.org/T417213) [14:12:08] phuedx: thanks again for the speedy fix/backport on your end! [14:12:09] !log ayounsi@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet [14:14:58] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:15:16] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 24 hosts with reason: Switch BGP bounce [14:15:29] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11702362 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cf71bad1-aeb4-4596-b577-d88e4e171aab) set by ayounsi@cumin1003 for 0:30:00 on 24 host(s) and their servi... [14:15:55] (03PS3) 10Ebernhardson: opensearch-semantic-search: Scale for additional wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250687 (https://phabricator.wikimedia.org/T414091) [14:19:55] (03CR) 10Gergő Tisza: "recheck" [extensions/OAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251088 (https://phabricator.wikimedia.org/T417278) (owner: 10Gergő Tisza) [14:19:57] (03CR) 10Ebernhardson: [C:03+2] opensearch-semantic-search: Scale for additional wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250687 (https://phabricator.wikimedia.org/T414091) (owner: 10Ebernhardson) [14:20:14] !log ayounsi@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet [14:21:59] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11702397 (10ayounsi) BGP bounce done by running those 2 commands "at the same time": ` tools network-instance default protocols bgp neighbor 10.64.128.17 reset-peer tools network-in... [14:22:12] (03Merged) 10jenkins-bot: opensearch-semantic-search: Scale for additional wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250687 (https://phabricator.wikimedia.org/T414091) (owner: 10Ebernhardson) [14:23:20] (03PS1) 10Arnaudb: gerrit: sync httpd config to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) [14:23:20] (03CR) 10Arnaudb: "1240197 had a messy rebase, I've created that CR instead." [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [14:24:25] (03CR) 10Jelto: "one comment in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1251075 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [14:24:58] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [14:25:32] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [14:27:01] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11702422 (10ayounsi) 05Open→03Resolved All servers have been repooled. [14:27:32] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11702426 (10Jelto) Adding #collaboration-services for the tcp-proxy hosts. When the hosts are ready and running the puppet role a conf... [14:29:16] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: fix failing discovery dns lookup in test spec [puppet] - 10https://gerrit.wikimedia.org/r/1250601 (https://phabricator.wikimedia.org/T411895) (owner: 10Jelto) [14:29:58] FIRING: KubernetesCalicoDown: dse-k8s-worker1028.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1028.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T1430) [14:31:13] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1018 [14:31:21] 10ops-codfw, 06DC-Ops, 06ServiceOps new: Possible hardware issues on wikikube-worker2332.codfw.wmnet - https://phabricator.wikimedia.org/T419747#11702440 (10JMeybohm) Hey #ops-codfw can you think of any work that could explain the cassis intrusion/power supply messages? [14:31:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1018 [14:31:44] !log andrew@cumin2002 START - Cookbook sre.dns.netbox [14:33:10] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:33:55] PROBLEM - Host dse-k8s-worker1018 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:04] (03PS1) 10Trueg: wikidata-platform: wdqs-queryhammer chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) [14:34:48] !log andrew@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:37:21] (03CR) 10Vgutierrez: [C:03+1] hcaptcha: Enable nginx caching for secure-api.js [puppet] - 10https://gerrit.wikimedia.org/r/1249929 (https://phabricator.wikimedia.org/T418865) (owner: 10Kosta Harlan) [14:38:39] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1018.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:39:11] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419854 (10tappof) 03NEW [14:39:37] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419855 (10tappof) 03NEW [14:39:53] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr1-esams:9804) - https://phabricator.wikimedia.org/T419856 (10tappof) 03NEW [14:40:23] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr1-esams:9804) - https://phabricator.wikimedia.org/T419857 (10tappof) 03NEW [14:40:36] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419858 (10tappof) 03NEW [14:40:54] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419859 (10tappof) 03NEW [14:41:47] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:42:04] (03CR) 10Vgutierrez: [C:03+1] gerrit: sync httpd config to ATS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [14:43:00] (03PS1) 10DDesouza: Deploy participant recruitment survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251098 (https://phabricator.wikimedia.org/T419778) [14:43:31] (03PS1) 10Ayounsi: decom cookbook: add --homer parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/1251099 [14:43:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251098 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza) [14:44:44] !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [14:44:47] 07sre-alert-triage, 06Data-Persistence: Alert in need of triage: SmartNotHealthy (instance aqs1015:9100) - https://phabricator.wikimedia.org/T419861 (10tappof) 03NEW [14:44:56] (03CR) 10Cwhite: [C:03+2] validator: add note about dot-delimited root fields [software/ecs] - 10https://gerrit.wikimedia.org/r/1245467 (owner: 10Cwhite) [14:45:00] !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [14:45:01] (03PS2) 10Arnaudb: gerrit: sync httpd config to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) [14:45:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:45:35] (03Merged) 10jenkins-bot: validator: add note about dot-delimited root fields [software/ecs] - 10https://gerrit.wikimedia.org/r/1245467 (owner: 10Cwhite) [14:45:44] (03CR) 10Arnaudb: gerrit: sync httpd config to ATS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [14:45:49] (03PS2) 10Muehlenhoff: Add tcp-proxy4003/tcp-proxy4004 [puppet] - 10https://gerrit.wikimedia.org/r/1251075 (https://phabricator.wikimedia.org/T418993) [14:46:09] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [14:46:16] (03CR) 10Muehlenhoff: Add tcp-proxy4003/tcp-proxy4004 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251075 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [14:49:05] (03CR) 10CI reject: [V:04-1] decom cookbook: add --homer parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/1251099 (owner: 10Ayounsi) [14:50:14] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11702595 (10MoritzMuehlenhoff) >>! In T418993#11702426, @Jelto wrote: > When the hosts are ready and running the puppet role a conftool... [14:50:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:02] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [14:51:02] (03PS2) 10Ayounsi: decom cookbook: add --homer parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/1251099 [14:53:58] (03PS2) 10Scott French: aptrepo: Temporarily remove php83-icu72 and references [puppet] - 10https://gerrit.wikimedia.org/r/1251101 (https://phabricator.wikimedia.org/T419058) [14:56:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1251101 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [14:56:14] (03CR) 10CI reject: [V:04-1] decom cookbook: add --homer parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/1251099 (owner: 10Ayounsi) [14:56:30] (03CR) 10Vgutierrez: gerrit: sync httpd config to ATS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [14:56:53] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [14:57:34] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [14:57:48] (03PS3) 10Arnaudb: gerrit: sync httpd config to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) [14:58:35] (03CR) 10Scott French: "Thanks, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1251101 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [14:58:37] (03CR) 10Scott French: [C:03+2] aptrepo: Temporarily remove php83-icu72 and references [puppet] - 10https://gerrit.wikimedia.org/r/1251101 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [14:59:00] (03CR) 10Arnaudb: gerrit: sync httpd config to ATS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [14:59:03] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [14:59:46] (03CR) 10CI reject: [V:04-1] gerrit: sync httpd config to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [14:59:50] (03CR) 10Jelto: [C:03+1] "lgtm now, I'm not sure if this will trigger some alerts when conftool is updates and puppet might not be finished on the new tcp-proxy hos" [puppet] - 10https://gerrit.wikimedia.org/r/1251075 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [15:00:09] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [15:00:18] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [15:00:35] (03CR) 10Arnaudb: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [15:01:20] (03CR) 10Muehlenhoff: "That's fine, conftool will initially add these as inactive, we need to enable them via conftool explicitly." [puppet] - 10https://gerrit.wikimedia.org/r/1251075 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [15:01:22] (03CR) 10Muehlenhoff: [C:03+2] Add tcp-proxy4003/tcp-proxy4004 [puppet] - 10https://gerrit.wikimedia.org/r/1251075 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [15:03:54] (03CR) 10Vgutierrez: [C:03+1] gerrit: sync httpd config to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [15:04:34] (03PS1) 10Scott French: Revert "aptrepo: Temporarily remove php83-icu72 and references" [puppet] - 10https://gerrit.wikimedia.org/r/1251105 (https://phabricator.wikimedia.org/T419058) [15:05:15] (03CR) 10Bking: [C:03+1] dse-k8s-eqiad: wdqs-queryhammer namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251040 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [15:05:19] (03CR) 10C. Scott Ananian: [C:04-1] ""parsoidrendered" isn't the right category, as most of our opt outs are on enwiki which is not (yet) in the parsoidrendered category." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238787 (https://phabricator.wikimedia.org/T414852) (owner: 10Jdlrobson) [15:05:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250750 (https://phabricator.wikimedia.org/T414852) (owner: 10C. Scott Ananian) [15:07:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1251105 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [15:08:02] (03CR) 10Scott French: [C:03+2] Revert "aptrepo: Temporarily remove php83-icu72 and references" [puppet] - 10https://gerrit.wikimedia.org/r/1251105 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [15:09:01] (03PS2) 10Daimona Eaytoy: phpunit: Avoid unnecessary writes in generatePHPUnitConfig.php [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251106 (https://phabricator.wikimedia.org/T419107) [15:10:35] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 3 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11702738 (10malberts) Not sure where exactly to comment, but the commit that was backported to REL 1.43 for this issue, is calling a... [15:11:23] (03PS1) 10Kevin Bazira: ml-services: update embeddings-staging image to one that supports aiter OOTB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251108 (https://phabricator.wikimedia.org/T419650) [15:11:35] (03PS3) 10Ayounsi: decom cookbook: add --homer parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/1251099 [15:12:17] !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=tcp-proxy4003.ulsfo.wmnet [15:12:30] !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=tcp-proxy4003.ulsfo.wmnet [15:12:40] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: wdqs-queryhammer namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251040 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [15:13:03] !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=tcp-proxy4004.ulsfo.wmnet [15:13:09] !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=tcp-proxy4004.ulsfo.wmnet [15:13:36] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:14:21] (03PS14) 10Harroyo-wmf: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) [15:14:22] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:17:26] 06SRE, 06Traffic: Startup failure for Bird on new durum hosts - https://phabricator.wikimedia.org/T419868 (10MoritzMuehlenhoff) 03NEW [15:19:33] (03CR) 10Dzahn: [C:03+1] Add lerickson and trueg to analytics-wikidata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [15:19:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2037:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2037 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:19:42] !log reuploadd libxml2 2.9.10+dfsg-6.7+deb11u9+wmf11u1 and 72.1-3+deb12u1~wmf11u1 to component/php83-icu72 for bullseye-wikimedia T419058 [15:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:46] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [15:22:08] (03CR) 10Ozge: [C:03+1] ml-services: update embeddings-staging image to one that supports aiter OOTB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251108 (https://phabricator.wikimedia.org/T419650) (owner: 10Kevin Bazira) [15:24:24] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11702843 (10MoritzMuehlenhoff) [15:24:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2037:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2037 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:24:58] (03CR) 10Phuedx: Add stream config for attribution research (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250249 (https://phabricator.wikimedia.org/T417050) (owner: 10TChin) [15:25:05] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11702847 (10MoritzMuehlenhoff) [15:25:26] (03CR) 10JMeybohm: [C:04-1] kafka-mirrormaker: allow multiple releases to be installed in the same namespace (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250631 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:26:58] !log ebernhardson@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [15:27:04] !log ebernhardson@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [15:30:11] (03PS1) 10Brouberol: dse-k8s-eqiad: add mw-page-edit-type-enrich-next to the flink tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251111 (https://phabricator.wikimedia.org/T351225) [15:30:57] (03CR) 10Lerickson: "Resolving this thread, thanks Daniel and David!" [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [15:31:27] (03CR) 10JavierMonton: [C:03+1] dse-k8s-eqiad: add mw-page-edit-type-enrich-next to the flink tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251111 (https://phabricator.wikimedia.org/T351225) (owner: 10Brouberol) [15:32:27] (03CR) 10AKhatun: [C:03+2] dse-k8s-eqiad: add mw-page-edit-type-enrich-next to the flink tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251111 (https://phabricator.wikimedia.org/T351225) (owner: 10Brouberol) [15:33:46] (03PS1) 10Jcrespo: mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) [15:33:59] !log joal@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [15:34:20] (03CR) 10CI reject: [V:04-1] mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [15:35:11] (03CR) 10Dzahn: [V:03+1 C:03+1] "compiled! looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1251009 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [15:35:17] (03PS1) 10Cathal Mooney: BGP Groups: change dse-k8s group to dse_k8s [homer/public] - 10https://gerrit.wikimedia.org/r/1251114 (https://phabricator.wikimedia.org/T414787) [15:35:29] !log joal@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [15:36:26] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11702933 (10RobH) Please note the maint window for this offline host is 2026-03-13 @ 07:00 AM Singapore / which is 5PM Thursday evening for me. I'll be online to remotely supervise the swap and attem... [15:36:39] !log andrew@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudgw2002-dev.codfw.wmnet [15:38:14] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in thanos-be2008 - https://phabricator.wikimedia.org/T419817#11702946 (10Jhancock.wm) replaced it with an offical warranty disk we had on hand. replacing that warranty disk with a new one from Dell. SR223804656 @MatthewVernon the disk in... [15:38:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in thanos-be2008 - https://phabricator.wikimedia.org/T419817#11702947 (10Jhancock.wm) a:03Jhancock.wm [15:40:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:40:58] (03PS2) 10Clément Goubert: rest-gateway: Allow full query param matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 [15:41:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:41:10] (03CR) 10Dwisehaupt: [C:03+1] civicrm: Run test on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1251090 (owner: 10Muehlenhoff) [15:41:24] (03PS2) 10Jcrespo: mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) [15:41:58] (03CR) 10CI reject: [V:04-1] mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [15:43:35] !log andrew@cumin2002 START - Cookbook sre.dns.netbox [15:43:56] (03PS3) 10Jcrespo: mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) [15:44:31] (03CR) 10CI reject: [V:04-1] mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [15:45:04] (03CR) 10JMeybohm: "I would not move stuff around as that makes looking at the diffs in CI and during deployment way more involved. But if-guarding the druid " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [15:45:06] (03PS1) 10Bking: dse-k8s: Add CFSSL profile for longer-lived certificates (6 mo). [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) [15:45:42] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Ben.buchenau - https://phabricator.wikimedia.org/T419878 (10Ben.buchenau) 03NEW [15:46:31] (03PS3) 10Clément Goubert: rest-gateway: Allow full query param matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 [15:47:10] !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw2002-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin2002" [15:47:21] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: Possible hardware issues on wikikube-worker2332.codfw.wmnet - https://phabricator.wikimedia.org/T419747#11703023 (10Jhancock.wm) there was a loose power cable earlier this week. it might have powered off during that time. power cables have been secured since.... [15:47:24] (03CR) 10Muehlenhoff: [C:03+2] civicrm: Run test on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1251090 (owner: 10Muehlenhoff) [15:47:33] (03CR) 10Clément Goubert: rest-gateway: Allow full query param matching (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 (owner: 10Clément Goubert) [15:48:54] (03PS4) 10Jcrespo: mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) [15:50:15] andrew@cumin2002 decommission (PID 1859017) is awaiting input [15:50:54] (03CR) 10CI reject: [V:04-1] mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [15:51:18] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11703059 (10MoritzMuehlenhoff) [15:51:47] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#11703061 (10Ben.buchenau) Thanks for all the clarification! In that case, being added to `analytics-wmde-users` sounds like the best way to go... [15:52:08] !log andrew@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw2002-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin2002" [15:52:08] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:52:10] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudgw2002-dev.codfw.wmnet [15:53:18] 10ops-codfw, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudgw2002-dev - https://phabricator.wikimedia.org/T419738#11703088 (10Andrew) a:05Andrew→03None [15:53:35] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#11703092 (10Dzahn) 05Open→03Resolved [15:54:16] (03CR) 10VolkerE: [C:03+1] Enable personal main menu to all users in Minerva Neue skin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240012 (https://phabricator.wikimedia.org/T413912) (owner: 10Bernard Wang) [15:54:21] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new, 07Essential-Work: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11703095 (10elukey) After a chat with Janis I partially solved the mistery that we were discussing... [15:54:37] (03PS1) 10Cathal Mooney: BGP Policy: Add dse_k8s policy in Nokia format [homer/public] - 10https://gerrit.wikimedia.org/r/1251118 (https://phabricator.wikimedia.org/T414787) [15:54:44] (03CR) 10Andrew Bogott: [C:03+1] "This got me past the blocker I was hitting. Note that I still had to specify --homer on the commandline; not sure if that was intended or " [cookbooks] - 10https://gerrit.wikimedia.org/r/1251099 (owner: 10Ayounsi) [15:54:53] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: Possible hardware issues on wikikube-worker2332.codfw.wmnet - https://phabricator.wikimedia.org/T419747#11703098 (10Scott_French) @Jhancock.wm - Ah, thanks for highlighting that! So, it sounds like the host may have lost power briefly when the power cables we... [15:55:47] (03CR) 10Andrew Bogott: [C:03+2] cloudidp2001-dev: force to puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206932 (owner: 10Andrew Bogott) [15:56:47] !log joal@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [15:56:54] (03Abandoned) 10Andrew Bogott: cloud-vps dynamic proxy: prometheus stats from nginx access logs [puppet] - 10https://gerrit.wikimedia.org/r/1059125 (https://phabricator.wikimedia.org/T371382) (owner: 10Andrew Bogott) [15:57:20] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update embeddings-staging image to one that supports aiter OOTB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251108 (https://phabricator.wikimedia.org/T419650) (owner: 10Kevin Bazira) [15:57:30] (03Abandoned) 10Andrew Bogott: puppetserver refactor: split out git service user profile from server [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [15:57:31] !log joal@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [15:57:47] !log joal@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:58:16] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: Possible hardware issues on wikikube-worker2332.codfw.wmnet - https://phabricator.wikimedia.org/T419747#11703121 (10Jhancock.wm) yes, that's the most likely cause. [15:58:31] !log joal@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:59:28] (03Merged) 10jenkins-bot: ml-services: update embeddings-staging image to one that supports aiter OOTB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251108 (https://phabricator.wikimedia.org/T419650) (owner: 10Kevin Bazira) [15:59:33] (03PS5) 10Jcrespo: mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) [15:59:49] (03PS1) 10C. Scott Ananian: Revert "Enables legacy processing in ParserOutputPostCacheTransform when cached" [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251121 [16:00:05] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:06] !log joal@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-fr-tech: apply [16:01:20] (03CR) 10Elukey: [C:03+1] Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1251091 (owner: 10Muehlenhoff) [16:01:45] (03CR) 10Elukey: [C:03+1] cfssl: Run tests on Bullseye and Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1251089 (owner: 10Muehlenhoff) [16:01:58] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [16:02:23] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [16:02:23] (03CR) 10Cathal Mooney: [C:03+2] BGP Groups: change dse-k8s group to dse_k8s [homer/public] - 10https://gerrit.wikimedia.org/r/1251114 (https://phabricator.wikimedia.org/T414787) (owner: 10Cathal Mooney) [16:02:39] !log joal@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-fr-tech: apply [16:02:55] !log joal@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [16:03:47] !log joal@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [16:03:54] (03Merged) 10jenkins-bot: BGP Groups: change dse-k8s group to dse_k8s [homer/public] - 10https://gerrit.wikimedia.org/r/1251114 (https://phabricator.wikimedia.org/T414787) (owner: 10Cathal Mooney) [16:04:01] !log joal@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [16:04:04] (03CR) 10Cathal Mooney: [C:03+2] BGP Policy: Add dse_k8s policy in Nokia format [homer/public] - 10https://gerrit.wikimedia.org/r/1251118 (https://phabricator.wikimedia.org/T414787) (owner: 10Cathal Mooney) [16:04:44] !log joal@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [16:04:55] !log joal@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [16:05:47] !log joal@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [16:05:54] (03Merged) 10jenkins-bot: BGP Policy: Add dse_k8s policy in Nokia format [homer/public] - 10https://gerrit.wikimedia.org/r/1251118 (https://phabricator.wikimedia.org/T414787) (owner: 10Cathal Mooney) [16:06:39] !log joal@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [16:07:18] !log joal@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [16:07:18] (03CR) 10Brouberol: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1251072 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [16:08:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:08:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [16:08:54] !log joal@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:09:36] !log joal@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:09:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [16:09:48] !log joal@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [16:10:32] !log joal@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [16:11:15] !log joal@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [16:11:58] 06SRE, 06DC-Ops, 10Wikidata: NVMe versus standard SSD performance info - https://phabricator.wikimedia.org/T419884 (10RobH) 03NEW p:05Triage→03Medium [16:12:01] !log joal@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [16:13:06] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in thanos-be2008 - https://phabricator.wikimedia.org/T419817#11703185 (10MatthewVernon) Thanks! New disk is configured and backfilling fine. [16:16:06] RECOVERY - Host dse-k8s-worker1028 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [16:16:22] (03PS1) 10Brouberol: flink-operator: annotate pods with the configmap checksum [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251128 [16:18:39] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1018.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:19:05] (03PS6) 10Jcrespo: mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) [16:19:10] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [16:19:37] !log dzahn@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [16:19:56] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: Possible hardware issues on wikikube-worker2332.codfw.wmnet - https://phabricator.wikimedia.org/T419747#11703207 (10Scott_French) Great, thanks @Jhancock.wm. Also good to know that the intrusion event seems to be something about the chassis of this particular... [16:19:56] !log dzahn@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:20:14] !log dzahn@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:20:29] 06SRE, 06DC-Ops, 10Wikidata: NVMe versus standard SSD performance info - https://phabricator.wikimedia.org/T419884#11703212 (10RobH) [16:20:35] !log dzahn@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:21:01] cp4 [16:21:33] (03PS2) 10Dzahn: switch status.wikimedia.org from rackspace to wikimedia [dns] - 10https://gerrit.wikimedia.org/r/1240414 (https://phabricator.wikimedia.org/T414098) [16:21:38] (03CR) 10Dzahn: [C:03+1] switch status.wikimedia.org from rackspace to wikimedia [dns] - 10https://gerrit.wikimedia.org/r/1240414 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [16:22:11] 06SRE, 06DC-Ops, 10Wikidata: NVMe versus standard SSD performance info - https://phabricator.wikimedia.org/T419884#11703215 (10RobH) After spending about 30 minutes on the Dell site I'm not locating the usual whitepapers the Dell Team sent us back when we selected the SSDs years and years ago, so I've asked... [16:24:03] (03CR) 10Dzahn: [C:03+1] "[deploy1003:~] $ curl --resolve status.wikimedia.org:30443:$(dig +short k8s-ingress-staging.discovery.wmnet) https://status.wikimedia.org:" [dns] - 10https://gerrit.wikimedia.org/r/1240414 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [16:24:12] (03CR) 10Dillon: [C:03+1] PersonalDashboard edit count configurations should have an upper bound and limit personal tools menu access too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251023 (https://phabricator.wikimedia.org/T418365) (owner: 10Kgraessle) [16:24:50] (03CR) 10Dzahn: [C:03+2] switch status.wikimedia.org from rackspace to wikimedia [dns] - 10https://gerrit.wikimedia.org/r/1240414 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [16:25:10] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS trixie [16:25:51] !log switching old status.wikimedia.org page away from rackspace T414098 [16:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:55] T414098: Move https://status.wikimedia.org/ away from rackspace - https://phabricator.wikimedia.org/T414098 [16:26:00] !log dzahn@dns1004 START - running authdns-update [16:26:09] (03PS7) 10Jcrespo: mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) [16:26:11] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [16:27:35] !log dzahn@dns1004 END - running authdns-update [16:27:48] (03PS2) 10Brouberol: flink-operator: annotate pods with the configmap checksum [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251128 [16:28:35] (03PS8) 10Jcrespo: mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) [16:28:53] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [16:29:04] (03PS1) 10AKhatun: stream: remove unwanted params in edit-type stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251130 (https://phabricator.wikimedia.org/T351225) [16:29:22] (03PS1) 10Scott French: Revert "mw-(api-int|web): Pilot drain configuration in canary" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251131 (https://phabricator.wikimedia.org/T364245) [16:29:46] (03CR) 10Dzahn: [C:03+2] "curl --resolve status.wikimedia.org:443:$(dig +short dyna.wikimedia.org) https://status.wikimedia.org:443" [dns] - 10https://gerrit.wikimedia.org/r/1240414 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [16:31:37] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:31:37] (03CR) 10JavierMonton: [C:03+1] stream: remove unwanted params in edit-type stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251130 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [16:32:23] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:33:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 195.200.68.151 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:33:12] (03PS9) 10Jcrespo: mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) [16:33:17] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [16:33:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [16:33:39] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:33:45] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:33:45] (03CR) 10CI reject: [V:04-1] mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [16:35:21] (03CR) 10AKhatun: [C:03+2] stream: remove unwanted params in edit-type stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251130 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [16:36:16] !log reprepro include php8.3_8.3.30-1+wmf11u2+icu72u1 into component/php83-icu72 - T419058 [16:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:21] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [16:37:34] (03Merged) 10jenkins-bot: stream: remove unwanted params in edit-type stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251130 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [16:37:49] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-backup1004.eqiad.wmnet with OS trixie [16:37:57] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11703264 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-backup1004.eqiad.wmnet with OS trixie [16:38:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:40:48] (03CR) 10Btullis: [C:03+1] flink-operator: annotate pods with the configmap checksum [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251128 (owner: 10Brouberol) [16:40:58] !log reprepro include php-defaults_94+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [16:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:00] (03CR) 10Brouberol: [C:03+2] flink-operator: annotate pods with the configmap checksum [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251128 (owner: 10Brouberol) [16:42:30] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1018.eqiad.wmnet with OS bookworm [16:42:55] (03CR) 10Btullis: [C:03+2] Create analytics-fr-tech system user and corresponding group [puppet] - 10https://gerrit.wikimedia.org/r/1251072 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [16:43:23] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:43:41] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Ben.buchenau - https://phabricator.wikimedia.org/T419878#11703280 (10WMDE-leszek) I approve this request on WMDE's end. Thank you [16:43:43] !log reprepro include dh-php_5.5+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [16:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:47] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [16:43:48] !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [16:43:53] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:44:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:45:25] (03PS3) 10Elukey: profile::logstash: drop kserve-controller's logs [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) [16:45:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:45:42] (03CR) 10Dzahn: [C:03+2] microsites: add monitoring for status.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1240417 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [16:45:47] !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [16:45:48] (03PS2) 10Dzahn: microsites: add monitoring for status.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1240417 (https://phabricator.wikimedia.org/T414098) [16:46:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:47:49] (03CR) 10Elukey: "I added it myself so we can unblock this. @cwhite@wikimedia.org we are ready to merge if you like the patch, it should drop some spam :)" [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [16:48:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 195.200.68.151 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:48:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [16:48:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:48:44] (03CR) 10Elukey: [C:04-1] profile::logstash: drop kserve-controller's logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [16:48:57] (03CR) 10Jforrester: [C:03+1] phpunit: Avoid unnecessary writes in generatePHPUnitConfig.php [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251106 (https://phabricator.wikimedia.org/T419107) (owner: 10Daimona Eaytoy) [16:50:02] btullis: creating the group breaks puppet [16:50:11] group '935' does not exist [16:50:19] !log reprepro include php-apcu_5.1.24-1+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [16:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:22] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [16:50:56] (03PS1) 10Vgutierrez: Revert "Create analytics-fr-tech system user and corresponding group" [puppet] - 10https://gerrit.wikimedia.org/r/1251135 [16:51:21] (03PS4) 10Elukey: profile::logstash: drop kserve-controller's logs [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) [16:51:33] (03CR) 10Dzahn: "there is a typo here, the GID is 953 in one place but 935 in another - puppet breaks because "group 935 does not exist"" [puppet] - 10https://gerrit.wikimedia.org/r/1251072 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [16:51:45] FIRING: [7x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:52:07] (03PS5) 10Elukey: profile::logstash: drop kserve-controller's logs [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) [16:52:08] (03CR) 10Vgutierrez: [C:03+2] Revert "Create analytics-fr-tech system user and corresponding group" [puppet] - 10https://gerrit.wikimedia.org/r/1251135 (owner: 10Vgutierrez) [16:52:25] (03CR) 10Elukey: "Now we are good :)" [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [16:53:07] (03CR) 10Dzahn: [C:03+1] Revert "Create analytics-fr-tech system user and corresponding group" [puppet] - 10https://gerrit.wikimedia.org/r/1251135 (owner: 10Vgutierrez) [16:54:03] (03CR) 10CI reject: [V:04-1] profile::logstash: drop kserve-controller's logs [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [16:54:58] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup1004.eqiad.wmnet with reason: host reimage [16:56:07] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: Possible hardware issues on wikikube-worker2332.codfw.wmnet - https://phabricator.wikimedia.org/T419747#11703336 (10JMeybohm) >>! In T419747#11703207, @Scott_French wrote: > @JMeybohm - It looks like this host booted quickly enough that pods weren't migrated... [16:56:16] (03PS6) 10Elukey: profile::logstash: drop kserve-controller's logs [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) [16:57:20] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [reason: trixie reimaging] [16:57:25] (03PS1) 10SomeRandomDeveloper: EditPage: Re-add catch block for MWException [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251138 (https://phabricator.wikimedia.org/T419883) [16:57:48] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS trixie [16:57:52] (03CR) 10Dzahn: [C:03+2] microsites: add monitoring for status.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1240417 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [16:58:02] !log reprepro include php-msgpack_3.0.0-1+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [16:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:06] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [16:58:13] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup1004.eqiad.wmnet with reason: host reimage [16:58:25] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp4038.ulsfo.wmnet [reason: trixie reimaging] [16:58:44] (03CR) 10CI reject: [V:04-1] profile::logstash: drop kserve-controller's logs [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [16:59:12] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS trixie [16:59:42] PROBLEM - Host backup2005 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:05] bd808: #bothumor I � Unicode. All rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T1700) [17:00:41] * bd808 checks for things to deploy [17:02:34] Nothing for me to push out today [17:03:01] (03Abandoned) 10C. Scott Ananian: Revert "Enables legacy processing in ParserOutputPostCacheTransform when cached" [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251121 (owner: 10C. Scott Ananian) [17:03:22] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp2027.* [17:03:42] (03PS7) 10Elukey: profile::logstash: drop kserve-controller's logs [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) [17:04:06] (03CR) 10Elukey: "ok the assumption is that "expected" should be zero since the filter drops all the data, and local tests pass now." [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [17:05:17] (03PS1) 10C. Scott Ananian: Ensure that we always run ParserHooks::transformHtml() when using Parsoid [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251139 (https://phabricator.wikimedia.org/T419830) [17:05:56] (03PS1) 10GergesShamon: [arwikiquote] add namespace alias for NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251140 (https://phabricator.wikimedia.org/T419828) [17:06:16] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp202[89].* [17:10:19] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: Possible hardware issues on wikikube-worker2332.codfw.wmnet - https://phabricator.wikimedia.org/T419747#11703417 (10RLazarus) a:03Scott_French [17:10:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251140 (https://phabricator.wikimedia.org/T419828) (owner: 10GergesShamon) [17:16:33] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores: Upgrade Kafka to version 3.x - https://phabricator.wikimedia.org/T416669#11703440 (10JMeybohm) [17:16:53] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [17:17:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [17:17:10] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-backup1004.eqiad.wmnet with OS trixie [17:17:20] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11703443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-backup1004.eqiad.wmnet with OS trixie completed: - ms-backup1004... [17:17:50] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11703444 (10Jclark-ctr) 05Open→03Resolved [17:18:18] (03CR) 10Muehlenhoff: "All new sudo groups needs to be approved in the weekly SRE IF meeting, which didn't happen here." [puppet] - 10https://gerrit.wikimedia.org/r/1251072 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [17:18:22] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage [17:20:03] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1018.eqiad.wmnet with OS bookworm [17:20:15] brett@cumin2002 reimage (PID 1867619) is awaiting input [17:20:34] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1018.eqiad.wmnet with OS bookworm [17:24:07] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage [17:26:45] RESOLVED: [7x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:27:16] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp203[0-5].* [17:27:34] jouncebot: nowandnext [17:27:34] For the next 0 hour(s) and 32 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T1700) [17:27:34] For the next 0 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T1700) [17:27:34] In 0 hour(s) and 32 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T1800) [17:28:21] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4045.ulsfo.wmnet with OS trixie [17:28:42] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS trixie [17:30:21] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1018.eqiad.wmnet with OS bookworm [17:30:35] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#11703463 (10MoritzMuehlenhoff) Wouldn't it make sense to directly move to trixie and skip bookworm? [17:31:07] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1018 [17:31:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1018 [17:32:54] (03PS1) 10Clément Goubert: Add new wikikube-worker137[3-4] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1251145 (https://phabricator.wikimedia.org/T416390) [17:33:22] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:37:10] btullis@cumin1003 provision (PID 3063301) is awaiting input [17:37:51] (03CR) 10Blake: [C:03+1] Add new wikikube-worker137[3-4] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1251145 (https://phabricator.wikimedia.org/T416390) (owner: 10Clément Goubert) [17:38:13] (03CR) 10Clément Goubert: [C:03+2] Add new wikikube-worker137[3-4] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1251145 (https://phabricator.wikimedia.org/T416390) (owner: 10Clément Goubert) [17:39:38] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11703516 (10Clement_Goubert) >>! In T416390#11702311, @Jclark-ctr wrote: > @Clement_Goubert Could you update site.pp it is missing. Also add these to preseed for efi... [17:39:56] (03PS2) 10Cwhite: ncredir: add wikimediastatus.net funnel [puppet] - 10https://gerrit.wikimedia.org/r/1242499 (https://phabricator.wikimedia.org/T419887) [17:40:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:41:19] (03PS1) 10Btullis: Add analytics-fr-tech system user and corresponding groups [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) [17:46:52] (03PS8) 10Cwhite: profile::logstash: drop kserve-controller's logs [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [17:47:00] (03CR) 10Daniel Kinzler: rest-gateway: Allow full query param matching (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 (owner: 10Clément Goubert) [17:47:45] (03PS1) 10Btullis: Temporarily puto dse-k8s-worker101[8-9] into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1251147 (https://phabricator.wikimedia.org/T414787) [17:48:38] (03PS1) 10Daniel Kinzler: rest gateway: make no-limit policy bypass rate limits. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251148 [17:49:14] (03CR) 10Cwhite: [C:03+2] profile::logstash: drop kserve-controller's logs [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [17:49:23] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4045.ulsfo.wmnet with reason: host reimage [17:49:26] btullis@cumin1003 reimage (PID 3063774) is awaiting input [17:49:38] (03CR) 10Btullis: [C:03+2] Temporarily puto dse-k8s-worker101[8-9] into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1251147 (https://phabricator.wikimedia.org/T414787) (owner: 10Btullis) [17:49:41] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4037.ulsfo.wmnet with OS trixie [17:49:46] (03CR) 10Cwhite: [C:03+2] "Thanks so much!!" [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [17:49:57] (03PS4) 10Clément Goubert: rest-gateway: exclude action API cspreport from rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 [17:50:07] (03CR) 10Clément Goubert: rest-gateway: exclude action API cspreport from rate limiting (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 (owner: 10Clément Goubert) [17:52:38] (03CR) 10Btullis: [C:03+2] Add lerickson and trueg to analytics-wikidata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [17:52:40] 06SRE, 10SRE-Access-Requests, 07OKR-Work, 13Patch-For-Review, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency / Add lerickson and trueg to analytics-wikidata-users - https://phabricator.wikimedia.org/T418723#11703572 (10BTullis) [17:52:41] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4045.ulsfo.wmnet with reason: host reimage [17:53:54] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [reason: trixie reimaging] [17:54:12] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1018.eqiad.wmnet with OS bookworm [17:55:11] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp20(3[6-9]|4[012]).* [17:56:56] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp4039.ulsfo.wmnet [reason: trixie reimaging] [17:57:08] (03PS3) 10Jsn.sherman: PersonalDashboard: enable CTA for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249364 (https://phabricator.wikimedia.org/T418613) [17:57:54] (03PS5) 10Daniel Kinzler: rest-gateway: exclude action API `action=cspreport` from rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 (owner: 10Clément Goubert) [17:58:19] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1019 [17:58:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1019 [17:59:12] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4039.ulsfo.wmnet with OS trixie [18:00:05] brennen and jeena: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T1800). [18:01:10] o/ [18:02:07] !log reprepro include php-igbinary_3.2.16-4+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [18:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:10] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [18:02:29] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1019.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:02:57] (03PS4) 10Jsn.sherman: PersonalDashboard: enable CTA for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249364 (https://phabricator.wikimedia.org/T418613) [18:03:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:03:54] (03CR) 10CI reject: [V:04-1] PersonalDashboard: enable CTA for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249364 (https://phabricator.wikimedia.org/T418613) (owner: 10Jsn.sherman) [18:04:58] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:06:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251139 (https://phabricator.wikimedia.org/T419830) (owner: 10C. Scott Ananian) [18:07:56] (03Merged) 10jenkins-bot: Ensure that we always run ParserHooks::transformHtml() when using Parsoid [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251139 (https://phabricator.wikimedia.org/T419830) (owner: 10C. Scott Ananian) [18:08:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:08:28] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1251139|Ensure that we always run ParserHooks::transformHtml() when using Parsoid (T419830)]] [18:08:31] T419830: Signatures aren't properly recognized on ruwiktionary and other projects when using Parsoid, resulting in a missing [ reply ] button - https://phabricator.wikimedia.org/T419830 [18:08:39] FIRING: KubernetesCalicoDown: dse-k8s-worker1019.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1019.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:10:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1019.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:10:36] !log brennen@deploy2002 cscott, brennen: Backport for [[gerrit:1251139|Ensure that we always run ParserHooks::transformHtml() when using Parsoid (T419830)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:14:58] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:16:01] cscott: hmm, not clear whether this fix actually worked. looking at https://ru.wiktionary.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D1%81%D0%BB%D0%BE%D0%B2%D0%B0%D1%80%D1%8C:%D0%A2%D0%B5%D1%85%D0%BD%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8%D0%B5_%D0%B2%D0%BE%D0%BF%D1%80%D0%BE%D1%81%D1%8B in mwdebug... [18:16:17] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4045.ulsfo.wmnet with OS trixie [18:18:07] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1018.eqiad.wmnet with OS bookworm [18:19:16] !log brennen@deploy2002 cscott, brennen: Continuing with sync [18:19:30] i'm going ahead with the sync since nothing seems to break. [18:20:11] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4038.ulsfo.wmnet with OS trixie [18:20:29] ...ah, helps if i enable javascript. ok, fix seems to work. [18:21:16] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [18:21:26] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 62708768 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:22:28] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2947464 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:23:14] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251139|Ensure that we always run ParserHooks::transformHtml() when using Parsoid (T419830)]] (duration: 14m 46s) [18:23:18] T419830: Signatures aren't properly recognized on ruwiktionary and other projects when using Parsoid, resulting in a missing [ reply ] button - https://phabricator.wikimedia.org/T419830 [18:24:52] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4039.ulsfo.wmnet with reason: host reimage [18:24:56] brennen sorry i'm late i can test. you might have needed to `action=purge` to test properly too. [18:25:00] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating dse-k8s-worker1019 - btullis@cumin1003" [18:25:41] cscott: yeah, i did that at one point as well so possible it was relevant but i think the main thing i was forgetting was that i hadn't exempted wikitionary domains from noscript. [18:25:49] brennen: in any case https://ru.wiktionary.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D1%81%D0%BB%D0%BE%D0%B2%D0%B0%D1%80%D1%8C:%D0%A2%D0%B5%D1%85%D0%BD%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8%D0%B5_%D0%B2%D0%BE%D0%BF%D1%80%D0%BE%D1%81%D1%8B looks like it is working now after purge [18:25:54] ::nod:: [18:26:11] going ahead with train. [18:26:26] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2332.codfw.wmnet [18:26:28] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2332.codfw.wmnet [18:26:33] thanks for the assist [18:26:35] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: Possible hardware issues on wikikube-worker2332.codfw.wmnet - https://phabricator.wikimedia.org/T419747#11703704 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by swfrench@cumin2002 pool for host wikikube-worker2332.codfw.wmnet com... [18:27:01] (03PS1) 10Gergő Tisza: login: Add 'alwaysShowLogin' login URL parameter [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251150 (https://phabricator.wikimedia.org/T419723) [18:27:30] brennen: FYI, I've repooled wikikube-worker2332 that caused trouble yesterday, as we believe we understand what caused the disruption. I'll be keeping an eye out for issues coming back, but wanted to flag with you just in case you see anything in logs. [18:27:32] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251151 (https://phabricator.wikimedia.org/T413810) [18:27:35] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251151 (https://phabricator.wikimedia.org/T413810) (owner: 10TrainBranchBot) [18:27:39] (the example on en.wikivoyage.org now works after purge as well, and i confirmed that the pages which *used to work* also *still work*. :) ) [18:27:46] swfrench-wmf: thanks for the heads up [18:27:56] (03PS1) 10Gergő Tisza: Use 'alwaysShowLogin' query parameter during login [extensions/CentralAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251152 (https://phabricator.wikimedia.org/T419723) [18:28:05] btullis@cumin1003 netbox (PID 3068500) is awaiting input [18:28:31] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251151 (https://phabricator.wikimedia.org/T413810) (owner: 10TrainBranchBot) [18:28:34] (03PS1) 10Aaron Schulz: Add WikiLambda extension REST module to the REST sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251154 (https://phabricator.wikimedia.org/T419053) [18:28:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/CentralAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251152 (https://phabricator.wikimedia.org/T419723) (owner: 10Gergő Tisza) [18:28:58] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating dse-k8s-worker1019 - btullis@cumin1003" [18:28:58] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:29:40] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4039.ulsfo.wmnet with reason: host reimage [18:30:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251150 (https://phabricator.wikimedia.org/T419723) (owner: 10Gergő Tisza) [18:32:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251106 (https://phabricator.wikimedia.org/T419107) (owner: 10Daimona Eaytoy) [18:34:13] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.19 refs T413810 [18:34:17] T413810: 1.46.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T413810 [18:36:22] (03PS5) 10Jsn.sherman: PersonalDashboard: enable CTA for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249364 (https://phabricator.wikimedia.org/T418613) [18:38:35] (03PS6) 10Jsn.sherman: PersonalDashboard: enable CTA for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249364 (https://phabricator.wikimedia.org/T418613) [18:40:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251138 (https://phabricator.wikimedia.org/T419883) (owner: 10SomeRandomDeveloper) [18:42:56] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp20(2[789]|3[0-9]|40).*,service=ats-be [18:44:16] (03Abandoned) 10Dzahn: add zuul-legacy to point at old zuul [dns] - 10https://gerrit.wikimedia.org/r/1250756 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [18:44:49] (03Abandoned) 10Dzahn: trafficserver/contint: add zuul-legacy site [puppet] - 10https://gerrit.wikimedia.org/r/1250757 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [18:45:30] (03CR) 10Dzahn: [C:03+2] "going ahead - no effect on non-trixie servers" [puppet] - 10https://gerrit.wikimedia.org/r/1250755 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [18:45:56] (03Abandoned) 10Aaron Schulz: Add WikiLambda extension REST module to the REST sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251154 (https://phabricator.wikimedia.org/T419053) (owner: 10Aaron Schulz) [18:46:24] (03PS2) 10Dzahn: profile::ci: add support for trixie / PHP8.4 [puppet] - 10https://gerrit.wikimedia.org/r/1250755 (https://phabricator.wikimedia.org/T418521) [18:46:53] brennen: once the dust settles, let me know if it would be alright to make tweaks to some of the canary deployments (reverting some experimental changes to envoy, which are largely transparent to mediawiki). [18:47:50] swfrench-wmf: will do. just going to test this backport once it lands and try to figure out if a couple of other errors are anything new, then should be clear. [18:48:02] awesome, thanks! [18:48:19] (03CR) 10RLazarus: [C:03+2] mathoid: Upgrade to envoy-future:1.35.9 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250728 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [18:48:34] (03CR) 10Dzahn: [C:03+2] profile::ci: add support for trixie / PHP8.4 [puppet] - 10https://gerrit.wikimedia.org/r/1250755 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [18:49:32] (03PS2) 10Dzahn: jenkins: add ci::httpd profile to role [puppet] - 10https://gerrit.wikimedia.org/r/1250752 (https://phabricator.wikimedia.org/T418521) [18:49:37] (03CR) 10Dzahn: jenkins: add ci::httpd profile to role [puppet] - 10https://gerrit.wikimedia.org/r/1250752 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [18:50:21] !incidents [18:50:21] 7745 (RESOLVED) This is a test incident (please ignore) [18:50:24] (03Merged) 10jenkins-bot: mathoid: Upgrade to envoy-future:1.35.9 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250728 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [18:52:02] I'm futzing around with envoy too but only in staging for now :) will sequence the prod changes in at a quiet moment [18:52:14] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mathoid: apply [18:52:41] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [18:53:26] * brennen twiddles thumbs, looks at zuul, twiddles thumbs [18:54:53] (03Merged) 10jenkins-bot: EditPage: Re-add catch block for MWException [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251138 (https://phabricator.wikimedia.org/T419883) (owner: 10SomeRandomDeveloper) [18:55:14] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1251138|EditPage: Re-add catch block for MWException (T419883)]] [18:55:18] T419883: MediaWiki\Exception\MWException: Format text/plain is not supported for content model wikitext - https://phabricator.wikimedia.org/T419883 [18:55:32] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4039.ulsfo.wmnet with OS trixie [18:57:09] !log brennen@deploy2002 somerandomdeveloper, brennen: Backport for [[gerrit:1251138|EditPage: Re-add catch block for MWException (T419883)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:58:04] (03PS1) 10Jdlrobson: Support duplication of languages in header and main menu [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251157 (https://phabricator.wikimedia.org/T419730) [18:59:26] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1019.eqiad.wmnet with OS bookworm [19:01:05] !log brennen@deploy2002 somerandomdeveloper, brennen: Continuing with sync [19:01:17] tested fix, looks good, going ahead. [19:03:28] !log reprepro include php-excimer_1.2.5-1+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [19:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:32] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [19:03:45] !log reprepro include php-imagick_3.7.0-13+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [19:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:59] !log reprepro include php-luasandbox_4.1.2-1+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [19:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:18] !log reprepro include php-memcached_3.3.0-1+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [19:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:30] !log reprepro include php-pcov_1.0.12-1+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [19:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:46] !log reprepro include php-redis_6.2.0-1+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [19:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:00] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251138|EditPage: Re-add catch block for MWException (T419883)]] (duration: 09m 46s) [19:05:01] !log reprepro include php-uuid_1.3.0-1+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [19:05:04] T419883: MediaWiki\Exception\MWException: Format text/plain is not supported for content model wikitext - https://phabricator.wikimedia.org/T419883 [19:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:18] !log reprepro include php-wmerrors_2.0.0-1+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [19:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:31] !log reprepro include php-xhprof_2.3.10-1+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [19:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:42] !log reprepro include php-yaml_2.2.4-1+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [19:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:54] !log reprepro include wikidiff2_1.14.1-2+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [19:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:07] !log reprepro include xdebug_3.4.4-1+wmf11u1+icu72u1 into component/php83-icu72 - T419058 [19:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:10] swfrench-wmf: over to you. [19:06:20] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp4039.ulsfo.wmnet [reason: trixie reimaging] [19:06:25] brennen: perfect timing :) thank you! [19:06:38] (03CR) 10Scott French: [C:03+2] Revert "mw-(api-int|web): Pilot drain configuration in canary" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251131 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [19:06:38] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp4041.ulsfo.wmnet [reason: trixie reimaging] [19:07:12] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4041.ulsfo.wmnet with OS trixie [19:08:47] (03Merged) 10jenkins-bot: Revert "mw-(api-int|web): Pilot drain configuration in canary" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251131 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [19:11:45] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [19:12:19] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [19:12:58] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [19:13:29] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [19:14:47] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [19:15:17] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [19:16:29] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [19:16:58] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [19:19:32] deploying mathoid, just an envoy version bump [19:19:45] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [19:20:17] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [19:21:19] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mathoid: apply [19:21:39] (03CR) 10Kosta Harlan: "@vgutierrez@wikimedia.org if you want to pair on deploying this, please let me know" [puppet] - 10https://gerrit.wikimedia.org/r/1249929 (https://phabricator.wikimedia.org/T418865) (owner: 10Kosta Harlan) [19:21:46] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [19:25:45] done, monitoring [19:26:10] forgot to say, all done on my end (coordinating externally with r.zl) [19:34:58] (03PS1) 10RLazarus: envoy: Update to v1.35.9 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1251162 (https://phabricator.wikimedia.org/T419637) [19:37:25] (03CR) 10RLazarus: [V:04-1] "This isn't ready just because I haven't copied the debian package yet. (If you build it locally you'll still get 1.35.7.) I'll do that and" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1251162 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [19:44:17] (03PS2) 10NMW03: Increase IP cap limit for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251164 [19:44:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251164 (owner: 10NMW03) [19:45:47] (03CR) 10CI reject: [V:04-1] Increase IP cap limit for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251164 (owner: 10NMW03) [19:49:01] (03PS3) 10NMW03: Increase IP cap limit for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251164 [19:50:01] (03CR) 10CI reject: [V:04-1] Increase IP cap limit for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251164 (owner: 10NMW03) [19:51:20] (03PS4) 10NMW03: Increase IP cap limit for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251164 [19:51:48] jouncebot: next [19:51:48] In 0 hour(s) and 8 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T2000) [19:53:47] (03PS5) 10NMW03: Increase IP cap limit for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251164 (https://phabricator.wikimedia.org/T419899) [19:56:37] (03PS1) 10Herron: site: prep o11ytest [puppet] - 10https://gerrit.wikimedia.org/r/1251167 (https://phabricator.wikimedia.org/T419902) [19:58:42] FIRING: [15x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_aux_front_proxy_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T2000). [20:00:05] JSherman, tgr, danisztls, cscott, _Gerges, and Nemoralis: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] o/ [20:00:12] o/ [20:00:14] o/ [20:00:28] <_Gerges_> here [20:00:32] o/ [20:01:38] happy to deploy for anyone who needs a deployer - otherwise maybe self-deployers can self-organize per queue? [20:01:44] (03CR) 10Scott French: [C:03+1] "Thanks for flagging. Yup, should do the right thing once the package is included." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1251162 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [20:01:59] o/ [20:02:01] I can self-deploy if it helps [20:02:06] sounds fine to me [20:02:09] my patch is for increasing IP cap limit, it requires an extra command because the event is in less than 72 hours. Sorry about that, I didn't have time to send a patch earlier [20:02:32] <_Gerges_> @cjming, me [20:02:47] JSherman: are you self-deploying? [20:02:49] Mine is flipping a few config flags; happy to consolidate any other low risk/no-i18n rebuild patches [20:02:53] yep [20:03:14] question about wmgThrottlingExceptions: which timezone should I use for "from" and "to"? Is it UTC? Thats what I used there [20:03:34] anybody want to be deployed with me before I throw the switch? [20:03:41] RESOLVED: [22x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_aux_front_proxy_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:03:56] the IP cap change I suppose? [20:04:08] and maybe _Gerges's patch? [20:04:10] sure [20:04:15] JSherman: mine maybe [20:04:27] mmk, lemme pull these together [20:05:08] i can run the namespaces dedupe script afterwards [20:05:19] Nemoralis: I don't think you need any extra command for a throttle that's in two days? [20:05:43] if it's needed - maybe it's not [20:05:55] tgr_: isn't 72 hours 3 days? Thats what the documentation on Wikitech says [20:05:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249364 (https://phabricator.wikimedia.org/T418613) (owner: 10Jsn.sherman) [20:05:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251140 (https://phabricator.wikimedia.org/T419828) (owner: 10GergesShamon) [20:05:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251098 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza) [20:05:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251164 (https://phabricator.wikimedia.org/T419899) (owner: 10NMW03) [20:06:07] here we go [20:06:11] nice [20:06:12] https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold [20:06:59] (03Merged) 10jenkins-bot: PersonalDashboard: enable CTA for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249364 (https://phabricator.wikimedia.org/T418613) (owner: 10Jsn.sherman) [20:07:03] (03Merged) 10jenkins-bot: [arwikiquote] add namespace alias for NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251140 (https://phabricator.wikimedia.org/T419828) (owner: 10GergesShamon) [20:07:11] (03Merged) 10jenkins-bot: Deploy participant recruitment survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251098 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza) [20:07:15] (03Merged) 10jenkins-bot: Increase IP cap limit for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251164 (https://phabricator.wikimedia.org/T419899) (owner: 10NMW03) [20:07:35] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1249364|PersonalDashboard: enable CTA for pilot wikis (T418613)]], [[gerrit:1251140|[arwikiquote] add namespace alias for NS_PROJECT (T419828)]], [[gerrit:1251098|Deploy participant recruitment survey on frwiki (T419778)]], [[gerrit:1251164|Increase IP cap limit for azwiki (T419899)]] [20:07:39] (mine could have been thrown in, too, but i'm fine w/ waiting for my turn) [20:07:46] T418613: Set live configurations for Extension:PersonalDashboard on pilot wikis - https://phabricator.wikimedia.org/T418613 [20:07:46] T419828: Namespace alias on ar.wikiquote - https://phabricator.wikimedia.org/T419828 [20:07:46] T419778: Deploy QuickSurvey for research participant registration drive on frwiki - https://phabricator.wikimedia.org/T419778 [20:07:47] T419899: Requesting temporary lift of IP cap for 14 March 2026 for azwiki - https://phabricator.wikimedia.org/T419899 [20:08:04] the account creation throttle has 24 hours expiry [20:08:25] so the documentation is out of date? [20:08:26] not sure where the 72 hour limit comes from [20:09:24] !log jsn@deploy2002 jsn, dani, nmw03, gergesshamon: Backport for [[gerrit:1249364|PersonalDashboard: enable CTA for pilot wikis (T418613)]], [[gerrit:1251140|[arwikiquote] add namespace alias for NS_PROJECT (T419828)]], [[gerrit:1251098|Deploy participant recruitment survey on frwiki (T419778)]], [[gerrit:1251164|Increase IP cap limit for azwiki (T419899)]] synced to the testservers (see https://wikitech.wikimedia.org/wik [20:09:24] i/Mwdebug). Changes can now be verified there. [20:09:27] I don't think it was ever correct [20:09:34] maybe I'm missing something [20:09:41] Nemoralis: _Gerges_: danisztls: please test as able [20:09:52] it was imported from somewhere else: https://wikitech.wikimedia.org/?diff=47172 [20:10:10] JSherman: I am not sure how to test my patch [20:10:18] you can't [20:10:28] not outside the event date range [20:10:30] JSherman: looks good [20:10:49] <_Gerges_> Everything I have is fine. [20:11:11] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!!" [puppet] - 10https://gerrit.wikimedia.org/r/1251167 (https://phabricator.wikimedia.org/T419902) (owner: 10Herron) [20:11:23] tgr_: do you mind updating the documentation then? Thanks for the information! [20:13:40] (03CR) 10Herron: [C:03+2] site: prep o11ytest [puppet] - 10https://gerrit.wikimedia.org/r/1251167 (https://phabricator.wikimedia.org/T419902) (owner: 10Herron) [20:14:44] okay, mine is actually a problem here, but it's really minor. I'm going to roll forward and add small followon [20:14:52] !log jsn@deploy2002 jsn, dani, nmw03, gergesshamon: Continuing with sync [20:15:42] TBH I'm not sure why you would need to clear the throttle even if you are changing it right before or during the event [20:17:05] it stores the number of actual registrations, not the limit [20:17:12] I'll just remove that [20:18:46] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1249364|PersonalDashboard: enable CTA for pilot wikis (T418613)]], [[gerrit:1251140|[arwikiquote] add namespace alias for NS_PROJECT (T419828)]], [[gerrit:1251098|Deploy participant recruitment survey on frwiki (T419778)]], [[gerrit:1251164|Increase IP cap limit for azwiki (T419899)]] (duration: 11m 11s) [20:18:55] T418613: Set live configurations for Extension:PersonalDashboard on pilot wikis - https://phabricator.wikimedia.org/T418613 [20:18:55] T419828: Namespace alias on ar.wikiquote - https://phabricator.wikimedia.org/T419828 [20:18:55] T419778: Deploy QuickSurvey for research participant registration drive on frwiki - https://phabricator.wikimedia.org/T419778 [20:18:56] T419899: Requesting temporary lift of IP cap for 14 March 2026 for azwiki - https://phabricator.wikimedia.org/T419899 [20:19:03] okey dokey [20:19:41] tgr_: you're up [20:20:04] <_Gerges_> Thanks [20:20:35] thx [20:20:42] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1019.eqiad.wmnet with OS bookworm [20:21:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/OAuth] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251087 (https://phabricator.wikimedia.org/T417278) (owner: 10Gergő Tisza) [20:21:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/OAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251088 (https://phabricator.wikimedia.org/T417278) (owner: 10Gergő Tisza) [20:21:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251106 (https://phabricator.wikimedia.org/T419107) (owner: 10Daimona Eaytoy) [20:21:58] JSherman: thanks! [20:23:04] (03PS1) 10Jsn.sherman: PersonalDashboard: enable CTA for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251168 (https://phabricator.wikimedia.org/T418613) [20:24:09] okay, I've tagged on a little followup I would like to squeak in if possible: 1251168 [20:24:36] JSherman: thanks! [20:24:43] (03Merged) 10jenkins-bot: Set 'sub' JWT field in client credentials access tokens [extensions/OAuth] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251087 (https://phabricator.wikimedia.org/T417278) (owner: 10Gergő Tisza) [20:24:53] (03Merged) 10jenkins-bot: Set 'sub' JWT field in client credentials access tokens [extensions/OAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251088 (https://phabricator.wikimedia.org/T417278) (owner: 10Gergő Tisza) [20:25:01] tgr_: i definitely had issues if i hadn't cleared the throttle when doing short-notice changes to throttle.php [20:25:34] I don't see how that's supposed to happen [20:25:43] worst case, you start from 6 rather than 0 [20:26:09] but that can happen anyway if you set the configuration a week age but then an hour before the even someone maxes out the thorttle [20:26:37] (03CR) 10Scott French: [V:03+2] "Built:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249981 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [20:28:13] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4041.ulsfo.wmnet with OS trixie [20:29:30] tgr_: urbanecm: is there any downside to running the sync for the patch from Nemoralis: ? [20:29:45] JSherman: can you link me the patch? [20:29:56] urbanecm: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1251164 [20:30:07] JSherman: go ahead [20:30:24] no downside for sure [20:30:32] it just seems like a cargo cult thing [20:30:46] (cargo cult?) [20:30:58] (doing the thing because the thing has been done) [20:31:13] https://en.wikipedia.org/wiki/Cargo_cult_programming [20:32:17] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4045.* [20:32:45] (03Abandoned) 10Kgraessle: PersonalDashboard edit count configurations should have an upper bound and limit personal tools menu access too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251023 (https://phabricator.wikimedia.org/T418365) (owner: 10Kgraessle) [20:33:21] okay running sync-file wmf-config/throttle.php [20:34:25] (that shouldn't be needed, provided you already run scap backport (patch id)) [20:35:06] (03Merged) 10jenkins-bot: phpunit: Avoid unnecessary writes in generatePHPUnitConfig.php [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251106 (https://phabricator.wikimedia.org/T419107) (owner: 10Daimona Eaytoy) [20:35:20] ah, that's what Nemoralis: was asking about [20:35:27] (03PS1) 10WMDE-leszek: Add output-dir option to specify target directory for JSON dumps [dumps] - 10https://gerrit.wikimedia.org/r/1251169 (https://phabricator.wikimedia.org/T401296) [20:35:41] !log jsn@deploy2002 Synchronized wmf-config/throttle.php: (no justification provided) (duration: 01m 57s) [20:35:55] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1251087|Set 'sub' JWT field in client credentials access tokens (T417278)]], [[gerrit:1251088|Set 'sub' JWT field in client credentials access tokens (T417278)]], [[gerrit:1251106|phpunit: Avoid unnecessary writes in generatePHPUnitConfig.php (T419107)]] [20:36:00] T417278: Choosing client credentials grant for OAuth 2 results in an anonymous access token - https://phabricator.wikimedia.org/T417278 [20:36:00] T419107: The PHPUnit config override does not appear to be auto-generated - https://phabricator.wikimedia.org/T419107 [20:36:40] (03CR) 10Scott French: [V:03+2] "Thanks, Moritz!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249981 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [20:37:15] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS trixie [20:37:42] !log tgr@deploy2002 tgr, daimona: Backport for [[gerrit:1251087|Set 'sub' JWT field in client credentials access tokens (T417278)]], [[gerrit:1251088|Set 'sub' JWT field in client credentials access tokens (T417278)]], [[gerrit:1251106|phpunit: Avoid unnecessary writes in generatePHPUnitConfig.php (T419107)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:38:32] cscott: any chance I can tag a little followup config change along with yours when it's time? [20:39:17] (03CR) 10Kgraessle: [C:03+1] PersonalDashboard: enable CTA for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251168 (https://phabricator.wikimedia.org/T418613) (owner: 10Jsn.sherman) [20:39:40] !log tgr@deploy2002 tgr, daimona: Continuing with sync [20:39:52] I can add some config patches in the next round [20:40:08] (03CR) 10Gergő Tisza: [C:03+2] Use 'alwaysShowLogin' query parameter during login [extensions/CentralAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251152 (https://phabricator.wikimedia.org/T419723) (owner: 10Gergő Tisza) [20:40:11] (03CR) 10Gergő Tisza: [C:03+2] login: Add 'alwaysShowLogin' login URL parameter [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251150 (https://phabricator.wikimedia.org/T419723) (owner: 10Gergő Tisza) [20:40:24] tgr_: thanks! https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1251168 [20:41:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251168 (https://phabricator.wikimedia.org/T418613) (owner: 10Jsn.sherman) [20:43:10] (03PS2) 10Scott French: php8.3-icu72: Create new ICU 72 flavored image stack [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249981 (https://phabricator.wikimedia.org/T419058) [20:43:32] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251087|Set 'sub' JWT field in client credentials access tokens (T417278)]], [[gerrit:1251088|Set 'sub' JWT field in client credentials access tokens (T417278)]], [[gerrit:1251106|phpunit: Avoid unnecessary writes in generatePHPUnitConfig.php (T419107)]] (duration: 07m 37s) [20:43:37] T417278: Choosing client credentials grant for OAuth 2 results in an anonymous access token - https://phabricator.wikimedia.org/T417278 [20:43:37] T419107: The PHPUnit config override does not appear to be auto-generated - https://phabricator.wikimedia.org/T419107 [20:43:39] FIRING: KubernetesCalicoDown: dse-k8s-worker1018.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1018.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:44:25] (03CR) 10Scott French: "Squashed into what was I4800c286d4a29ded5a7b77c26e6edfb359694b83." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249981 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [20:44:50] JSherman: sure? [20:45:02] (03CR) 10Scott French: [V:03+2] "https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1249981/comments/92fb99e9_10a2ebe0" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249981 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [20:45:27] I'll add cscott's patch as well [20:46:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251152 (https://phabricator.wikimedia.org/T419723) (owner: 10Gergő Tisza) [20:46:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251150 (https://phabricator.wikimedia.org/T419723) (owner: 10Gergő Tisza) [20:46:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251168 (https://phabricator.wikimedia.org/T418613) (owner: 10Jsn.sherman) [20:46:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250750 (https://phabricator.wikimedia.org/T414852) (owner: 10C. Scott Ananian) [20:46:43] tgr_: thanks! [20:47:00] (03Merged) 10jenkins-bot: PersonalDashboard: enable CTA for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251168 (https://phabricator.wikimedia.org/T418613) (owner: 10Jsn.sherman) [20:47:03] (03Merged) 10jenkins-bot: Enable parser survey for opted-out users on ru/pt/ja/id wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250750 (https://phabricator.wikimedia.org/T414852) (owner: 10C. Scott Ananian) [20:49:33] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS trixie [20:49:55] (03CR) 10Scott French: [V:03+2 C:03+2] php8.3-icu72: Create new ICU 72 flavored image stack [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249981 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [20:51:38] (03Merged) 10jenkins-bot: login: Add 'alwaysShowLogin' login URL parameter [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251150 (https://phabricator.wikimedia.org/T419723) (owner: 10Gergő Tisza) [20:51:42] (03Merged) 10jenkins-bot: Use 'alwaysShowLogin' query parameter during login [extensions/CentralAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251152 (https://phabricator.wikimedia.org/T419723) (owner: 10Gergő Tisza) [20:52:03] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1251152|Use 'alwaysShowLogin' query parameter during login (T419723)]], [[gerrit:1251150|login: Add 'alwaysShowLogin' login URL parameter (T419723)]], [[gerrit:1251168|PersonalDashboard: enable CTA for pilot wikis (T418613)]], [[gerrit:1250750|Enable parser survey for opted-out users on ru/pt/ja/id wikis (T414852)]] [20:52:08] ok, i have another backport once tgr_ is done, to fix an UBN in production. [20:52:12] T419723: New accounts created from editor anon warning redirect to welcome survey, not back to editor (2026) - https://phabricator.wikimedia.org/T419723 [20:52:12] T418613: Set live configurations for Extension:PersonalDashboard on pilot wikis - https://phabricator.wikimedia.org/T418613 [20:52:12] T414852: Run a survey to understand why existing logged in users might be opting out of Parsoid - https://phabricator.wikimedia.org/T414852 [20:52:29] (03PS1) 10C. Scott Ananian: Revert "Move post-processing of flaggedrevs views inside FlaggablePageView" [extensions/FlaggedRevs] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251173 [20:52:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/FlaggedRevs] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251173 (owner: 10C. Scott Ananian) [20:52:41] (03PS1) 10Herron: preseed: add o11ytest [puppet] - 10https://gerrit.wikimedia.org/r/1251174 (https://phabricator.wikimedia.org/T419902) [20:53:58] !log tgr@deploy2002 tgr, jsn, cscott: Backport for [[gerrit:1251152|Use 'alwaysShowLogin' query parameter during login (T419723)]], [[gerrit:1251150|login: Add 'alwaysShowLogin' login URL parameter (T419723)]], [[gerrit:1251168|PersonalDashboard: enable CTA for pilot wikis (T418613)]], [[gerrit:1250750|Enable parser survey for opted-out users on ru/pt/ja/id wikis (T414852)]] synced to the testservers (see https://wikitech [20:53:58] .wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:54:20] (03Abandoned) 10Scott French: php8.3-icu72: Clone php8.3 image stack [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249980 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [20:54:41] tgr_: ok, i'll verify that the survey is live [20:54:52] urbanecm: ready for testing [20:55:10] tgr_: the returnto thing, right? [20:55:20] yes [20:55:24] testing [20:56:17] tgr_: works like a charm [20:56:23] looking good [20:57:14] helps if i turn on x-wikimedia-debug when i'm testing [20:57:23] but yes, looks good [20:58:06] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4046.ulsfo.wmnet with reason: host reimage [20:58:34] !log tgr@deploy2002 tgr, jsn, cscott: Continuing with sync [20:59:17] (03CR) 10Herron: [C:03+2] preseed: add o11ytest [puppet] - 10https://gerrit.wikimedia.org/r/1251174 (https://phabricator.wikimedia.org/T419902) (owner: 10Herron) [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T2100) [21:02:01] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11704396 (10bd808) >>! In T353891#11701203, @ABran-WMF wrote: > with {T286066} done it should be better: > > Please l... [21:02:19] before webteam deploys, i need to deploy an UBN for T414359 [21:02:20] T414359: Use postprocessing cache in FlaggedRevisions - https://phabricator.wikimedia.org/T414359 [21:02:45] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251152|Use 'alwaysShowLogin' query parameter during login (T419723)]], [[gerrit:1251150|login: Add 'alwaysShowLogin' login URL parameter (T419723)]], [[gerrit:1251168|PersonalDashboard: enable CTA for pilot wikis (T418613)]], [[gerrit:1250750|Enable parser survey for opted-out users on ru/pt/ja/id wikis (T414852)]] (duration: 10m 41s) [21:02:52] T419723: New accounts created from editor anon warning redirect to welcome survey, not back to editor (2026) - https://phabricator.wikimedia.org/T419723 [21:02:52] T418613: Set live configurations for Extension:PersonalDashboard on pilot wikis - https://phabricator.wikimedia.org/T418613 [21:02:53] T414852: Run a survey to understand why existing logged in users might be opting out of Parsoid - https://phabricator.wikimedia.org/T414852 [21:03:24] cscott: you good to self deploy? [21:03:32] yes. tgr_ are you done? [21:03:43] cscott: yes [21:03:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [extensions/FlaggedRevs] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251173 (owner: 10C. Scott Ananian) [21:04:01] ok go go go [21:04:36] the artists formerly known as web team don't always use this window anyhow, so this is one of the less disruptive slots to overrun into. [21:04:39] i didn't closely follow exactly whose patches got rolled up in whose, but everyone got on the train? [21:04:52] (03CR) 10Anne Tomasevich: [C:04-1] "-1 for now while we fix a couple of minor issues noted in the original patch" [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251157 (https://phabricator.wikimedia.org/T419730) (owner: 10Jdlrobson) [21:04:54] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4046.ulsfo.wmnet with reason: host reimage [21:05:19] (03Merged) 10jenkins-bot: Revert "Move post-processing of flaggedrevs views inside FlaggablePageView" [extensions/FlaggedRevs] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251173 (owner: 10C. Scott Ananian) [21:05:29] looks like everybody got in [21:05:37] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1251173|Revert "Move post-processing of flaggedrevs views inside FlaggablePageView"]] [21:07:26] !log cscott@deploy2002 cscott: Backport for [[gerrit:1251173|Revert "Move post-processing of flaggedrevs views inside FlaggablePageView"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:07:34] (03PS3) 10Elukey: profile::pyrra: remove old wdqs SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1248761 (https://phabricator.wikimedia.org/T393966) [21:09:08] !log cscott@deploy2002 cscott: Continuing with sync [21:09:22] yep, test looks good. i have successfully UBN'ed the UBN. [21:09:26] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage [21:13:06] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251173|Revert "Move post-processing of flaggedrevs views inside FlaggablePageView"]] (duration: 07m 28s) [21:13:12] where's my t-shirt. :) [21:13:42] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage [21:13:44] anyway, i'm done. if we skipped anyone in the backport window or if the web team has anything they need deployed, the window is open. [21:15:59] (03PS8) 10Ryan Kemper: wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) [21:15:59] (03PS10) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) [21:20:01] (03CR) 10Ryan Kemper: [C:03+2] wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [21:20:06] (03CR) 10Ryan Kemper: [C:03+2] wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [21:20:32] !log rzl@apt1002:~$ sudo -i reprepro -C main includedeb bullseye-wikimedia /srv/wikimedia/pool/component/envoy-future/e/envoyproxy/envoyproxy_1.35.9-1_amd64.deb [21:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:40] !log rzl@apt1002:~$ sudo -i reprepro copy bookworm-wikimedia bullseye-wikimedia envoyproxy [21:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:47] !log rzl@apt1002:~$ sudo -i reprepro copy trixie-wikimedia bullseye-wikimedia envoyproxy [21:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:24] (03CR) 10RLazarus: [V:03+2 C:03+2] "Packages copied, image built and checked locally." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1251162 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [21:23:28] cscott: 🎉 👕 [21:26:42] !log herron@cumin1003 START - Cookbook sre.hosts.rename from mwlog2002 to o11ytest2001 [21:27:05] !log herron@cumin1003 START - Cookbook sre.dns.netbox [21:27:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:28:08] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4046.ulsfo.wmnet with OS trixie [21:28:42] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-03-10-224034 to 2026-03-12-210521 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251181 (https://phabricator.wikimedia.org/T419788) [21:29:19] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-03-10-224034 to 2026-03-12-210521 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251181 (https://phabricator.wikimedia.org/T419788) (owner: 10Jforrester) [21:31:26] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-03-10-224034 to 2026-03-12-210521 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251181 (https://phabricator.wikimedia.org/T419788) (owner: 10Jforrester) [21:31:38] !log herron@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mwlog2002 to o11ytest2001 - herron@cumin1003" [21:32:09] !log herron@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mwlog2002 to o11ytest2001 - herron@cumin1003" [21:32:09] !log herron@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:32:09] !log herron@cumin1003 START - Cookbook sre.dns.wipe-cache o11ytest2001 on all recursors [21:32:13] !log herron@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) o11ytest2001 on all recursors [21:32:13] !log herron@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host o11ytest2001 [21:32:40] (03PS1) 10RLazarus: mw-*: Upgrade to Envoy 1.35.9 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251182 (https://phabricator.wikimedia.org/T419637) [21:32:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:33:46] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:34:06] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:34:56] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:35:10] !log herron@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host o11ytest2001 [21:35:25] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:35:33] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:35:46] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mwlog2002 to o11ytest2001 [21:36:05] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:39:01] jasmine@cumin2002 reimage (PID 1944196) is awaiting input [21:39:06] !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host o11ytest2001.codfw.wmnet with OS trixie [21:39:14] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS trixie [21:39:17] !log herron@cumin1003 START - Cookbook sre.hosts.move-vlan for host o11ytest2001 [21:39:28] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#11704546 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jasmine@cumin2002 for host wikikube-ctrl2006... [21:39:37] !log herron@cumin1003 START - Cookbook sre.dns.netbox [21:39:58] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Grant Access to ops for ebernhardson - https://phabricator.wikimedia.org/T419029#11704547 (10EBernhardson) 05Stalled→03Invalid [21:40:09] (03PS1) 10RLazarus: {api,rest}-gateway: Update to Envoy 1.35.9 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251185 (https://phabricator.wikimedia.org/T419637) [21:40:33] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4038.ulsfo.wmnet with OS trixie [21:41:14] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp4038.ulsfo.wmnet [reason: trixie reimaging] [21:43:30] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4041.ulsfo.wmnet with OS trixie [21:45:11] !log herron@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host o11ytest2001 - herron@cumin1003" [21:45:16] !log herron@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host o11ytest2001 - herron@cumin1003" [21:45:16] !log herron@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:45:16] !log herron@cumin1003 START - Cookbook sre.dns.wipe-cache o11ytest2001.codfw.wmnet 9.32.192.10.in-addr.arpa 9.0.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:45:20] !log herron@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) o11ytest2001.codfw.wmnet 9.32.192.10.in-addr.arpa 9.0.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:45:20] !log herron@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host o11ytest2001 [21:45:37] jasmine@cumin2002 reimage (PID 1944196) is awaiting input [21:45:56] !log herron@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host o11ytest2001 [21:45:56] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host o11ytest2001 [21:48:50] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#11704581 (10Scott_French) @MoritzMuehlenhoff - So, there were two motivations for targeting bookworm initially: 1. Concerns about incompatibility... [21:49:20] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#11704582 (10Scott_French) [21:51:38] Anyone object to us doing a quick backport soon as an extended pickup from the web team deployment window? [21:52:41] We're still doing a thorough review though, may be another 45-50 mins [21:52:55] (03PS1) 10RLazarus: mw-*: Update to Envoy 1.35.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251187 (https://phabricator.wikimedia.org/T419637) [21:53:38] (03PS1) 10Ryan Kemper: wdqs: Plumb deadlock threshold/cooldown thru hiera [puppet] - 10https://gerrit.wikimedia.org/r/1251188 (https://phabricator.wikimedia.org/T242453) [21:54:01] bvibber: I'll be bouncing some envoys around but not quite ready yet either, we can coordinate -- if you're ready before me, feel free [21:54:17] (03CR) 10Scott French: [C:03+1] mw-*: Upgrade to Envoy 1.35.9 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251182 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [21:55:05] (03CR) 10Scott French: [C:03+1] {api,rest}-gateway: Update to Envoy 1.35.9 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251185 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [21:58:01] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp4040.ulsfo.wmnet [reason: trixie reimaging] [21:58:36] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4040.ulsfo.wmnet with OS trixie [21:59:44] (03PS1) 10Bvibber: Enable ReaderExperiments Share Highlight subfeature for metrics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251190 (https://phabricator.wikimedia.org/T416945) [22:00:11] (03PS2) 10Ryan Kemper: wdqs: Plumb deadlock threshold/cooldown thru hiera [puppet] - 10https://gerrit.wikimedia.org/r/1251188 (https://phabricator.wikimedia.org/T242453) [22:00:19] rzl: awesome thx :D [22:00:35] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251188 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [22:00:50] (03PS2) 10RLazarus: mw-*: Upgrade to Envoy 1.35.9 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251182 (https://phabricator.wikimedia.org/T419637) [22:00:50] (03PS2) 10RLazarus: mw-*: Update to Envoy 1.35.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251187 (https://phabricator.wikimedia.org/T419637) [22:01:30] !log jasmine@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet with OS trixie [22:01:46] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#11704599 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jasmine@cumin2002 for host wikikube-ctrl2006.cod... [22:03:33] !log herron@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on o11ytest2001.codfw.wmnet with reason: host reimage [22:04:15] (03CR) 10Scott French: [C:03+1] mw-*: Upgrade to Envoy 1.35.9 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251182 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [22:04:58] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:05:45] (03PS1) 10RLazarus: mw-parsoid: Delete values-canary.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251192 (https://phabricator.wikimedia.org/T386246) [22:07:08] bvibber: okay if I go ahead? [22:08:23] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4041.ulsfo.wmnet with reason: host reimage [22:09:30] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on o11ytest2001.codfw.wmnet with reason: host reimage [22:10:22] ETIMEDOUT, proceeding :) [22:10:31] (03CR) 10RLazarus: [C:03+2] mw-*: Upgrade to Envoy 1.35.9 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251182 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [22:12:56] (03Merged) 10jenkins-bot: mw-*: Upgrade to Envoy 1.35.9 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251182 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [22:13:27] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4041.ulsfo.wmnet with reason: host reimage [22:14:58] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:15:30] rzl: go for it :D [22:15:33] sorry was in middle of CR :D [22:17:39] !log rzl@deploy2002 Started scap sync-world: https://gerrit.wikimedia.org/r/1251182 T419637 [22:17:43] T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637 [22:20:14] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4046.* [22:22:37] this'll either time out or succeed slower than usual because I accidentally skipped a step, either way is okay [22:23:21] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp4042.ulsfo.wmnet [reason: trixie reimaging] [22:23:52] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4042.ulsfo.wmnet with OS trixie [22:23:55] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#11704620 (10jasmine_) For visibility, I've intentionally aborted the reimage due to the following [0] which I will investigate &... [22:24:31] !log rzl@deploy2002 rzl: https://gerrit.wikimedia.org/r/1251182 T419637 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:24:35] T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637 [22:24:40] first try 😅 [22:25:35] (03CR) 10JHathaway: sre.hosts.provision: allow no-pxe settings for NIC on Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1249973 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [22:26:19] (03PS1) 10Codename Noreste: idwiki: Remove unused user groups on Indonesian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) [22:26:41] !log rzl@deploy2002 rzl: Continuing with sync [22:27:57] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host o11ytest2001.codfw.wmnet with OS trixie [22:28:00] (03PS1) 10Bvibber: Metrics module for share highlight experiment baseline [extensions/ReaderExperiments] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251194 (https://phabricator.wikimedia.org/T416945) [22:28:10] !log rzl@deploy2002 Finished scap sync-world: https://gerrit.wikimedia.org/r/1251182 T419637 (duration: 11m 18s) [22:28:17] (03PS1) 10Bvibber: Metrics module for share highlight experiment baseline [extensions/ReaderExperiments] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251195 (https://phabricator.wikimedia.org/T416945) [22:28:37] rzl: all done or any more coming? [22:29:16] bvibber: go ahead! I'll touch api-gateway and rest-gateway next, but nothing that needs the scap lock [22:29:24] awesome sauce :D thanks! [22:30:33] (03Abandoned) 10Bvibber: Metrics module for share highlight experiment baseline [extensions/ReaderExperiments] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1251194 (https://phabricator.wikimedia.org/T416945) (owner: 10Bvibber) [22:30:55] starting... [22:31:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251190 (https://phabricator.wikimedia.org/T416945) (owner: 10Bvibber) [22:31:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251195 (https://phabricator.wikimedia.org/T416945) (owner: 10Bvibber) [22:31:56] (03Merged) 10jenkins-bot: Enable ReaderExperiments Share Highlight subfeature for metrics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251190 (https://phabricator.wikimedia.org/T416945) (owner: 10Bvibber) [22:32:09] (03Merged) 10jenkins-bot: Metrics module for share highlight experiment baseline [extensions/ReaderExperiments] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251195 (https://phabricator.wikimedia.org/T416945) (owner: 10Bvibber) [22:32:28] !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1251190|Enable ReaderExperiments Share Highlight subfeature for metrics (T416945)]], [[gerrit:1251195|Metrics module for share highlight experiment baseline (T416945)]] [22:32:32] T416945: Measure current article text highlighting behavior - https://phabricator.wikimedia.org/T416945 [22:34:24] 06SRE, 07Sustainability (Incident Followup): Noise in #wikimedia-operations is making incident response more difficult - https://phabricator.wikimedia.org/T417163#11704668 (10herron) >>! In T417163#11659366, @herron wrote: > * Detailed START/STATUS/DONE/PASS output. Along with a single "heads up" message to... [22:34:25] !log bvibber@deploy2002 bvibber: Backport for [[gerrit:1251190|Enable ReaderExperiments Share Highlight subfeature for metrics (T416945)]], [[gerrit:1251195|Metrics module for share highlight experiment baseline (T416945)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:35:18] !log bvibber@deploy2002 bvibber: Continuing with sync [22:37:10] (03CR) 10Scott French: [C:03+1] mw-*: Update to Envoy 1.35.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251187 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [22:38:17] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4041.ulsfo.wmnet with OS trixie [22:39:17] !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251190|Enable ReaderExperiments Share Highlight subfeature for metrics (T416945)]], [[gerrit:1251195|Metrics module for share highlight experiment baseline (T416945)]] (duration: 06m 49s) [22:39:20] T416945: Measure current article text highlighting behavior - https://phabricator.wikimedia.org/T416945 [22:39:21] whee [22:41:13] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp4041.ulsfo.wmnet [reason: trixie reimaging] [22:42:12] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp4043.ulsfo.wmnet [reason: trixie reimaging] [22:42:40] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4043.ulsfo.wmnet with OS trixie [22:50:39] !log herron@cumin1003 START - Cookbook sre.hosts.rename from mwlog1002 to o11ytest1001 [22:50:56] "too many requests" errors on Phabricator, even though I'm logged in? I'm just browsing around, not even doing very much… [22:51:01] !log herron@cumin1003 START - Cookbook sre.dns.netbox [22:54:39] !log herron@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mwlog1002 to o11ytest1001 - herron@cumin1003" [22:55:02] !log herron@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mwlog1002 to o11ytest1001 - herron@cumin1003" [22:55:02] !log herron@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:55:03] !log herron@cumin1003 START - Cookbook sre.dns.wipe-cache o11ytest1001 on all recursors [22:55:06] !log herron@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) o11ytest1001 on all recursors [22:55:07] !log herron@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host o11ytest1001 [22:57:51] !log herron@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host o11ytest1001 [22:58:28] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mwlog1002 to o11ytest1001 [22:59:52] !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host o11ytest1001.eqiad.wmnet with OS trixie [22:59:59] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1018.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:00:04] !log herron@cumin1003 START - Cookbook sre.hosts.move-vlan for host o11ytest1001 [23:00:24] !log herron@cumin1003 START - Cookbook sre.dns.netbox [23:04:17] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS trixie [23:06:08] herron@cumin1003 reimage (PID 3099845) is awaiting input [23:06:46] (03CR) 10Ryan Kemper: "self-merging; low blast radius and it fixes an issue where threshold/cooldown isn't set to our preferred levels" [puppet] - 10https://gerrit.wikimedia.org/r/1251188 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [23:06:48] (03CR) 10Ryan Kemper: [C:03+2] wdqs: Plumb deadlock threshold/cooldown thru hiera [puppet] - 10https://gerrit.wikimedia.org/r/1251188 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [23:07:46] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11704748 (10Papaul) Last update from Nokia today ` The following was added as a limitation under release notes: Management Release:25.10.2 Section:... [23:12:53] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudgw2002-dev - https://phabricator.wikimedia.org/T419738#11704769 (10Papaul) @Jhancock.wm try to delete the mgmt ip address and run the script again https://netbox.wikimedia.org/dcim/devices/3026/interfaces/ [23:17:26] (03CR) 10JJMC89: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [23:18:13] (03CR) 10CI reject: [V:04-1] idwiki: Remove unused user groups on Indonesian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [23:18:42] !log herron@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host o11ytest1001 - herron@cumin1003" [23:18:47] !log herron@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host o11ytest1001 - herron@cumin1003" [23:18:47] !log herron@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:18:47] !log herron@cumin1003 START - Cookbook sre.dns.wipe-cache o11ytest1001.eqiad.wmnet 141.32.64.10.in-addr.arpa 1.4.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [23:18:51] !log herron@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) o11ytest1001.eqiad.wmnet 141.32.64.10.in-addr.arpa 1.4.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [23:18:51] !log herron@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host o11ytest1001 [23:19:36] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4040.ulsfo.wmnet with OS trixie [23:21:30] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4040.ulsfo.wmnet with OS trixie [23:22:56] !log herron@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host o11ytest1001 [23:22:56] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host o11ytest1001 [23:24:46] (03CR) 10Codename Noreste: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [23:30:08] (03CR) 10RLazarus: [C:03+2] {api,rest}-gateway: Update to Envoy 1.35.9 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251185 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [23:32:12] cccccbukvgbcljrihjejnfdeividdrkkrrbhkgkgvnuk [23:32:22] (03Merged) 10jenkins-bot: {api,rest}-gateway: Update to Envoy 1.35.9 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251185 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [23:35:40] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [23:35:54] (03CR) 10Scott French: "In theory, it can be deployed any time before all RO-routed traffic shifts to eqiad during switchover Day 1 (Tuesday). Doing so on Monday " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [23:35:56] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [23:35:57] rolling out envoy updates to api-gateway and rest-gateway (the staging change is a no-op) [23:36:09] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [23:36:26] !log herron@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on o11ytest1001.eqiad.wmnet with reason: host reimage [23:36:36] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [23:38:21] (03PS2) 10Eevans: cassandra-http-gateway: new chart based on aqs-http-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112) [23:38:21] (03PS2) 10Eevans: charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) [23:38:21] (03PS2) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) [23:40:23] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on o11ytest1001.eqiad.wmnet with reason: host reimage [23:41:10] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [23:41:30] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [23:41:32] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage [23:44:52] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4042.ulsfo.wmnet with OS trixie [23:45:13] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage [23:45:18] correction, deploying rest-gateway in staging will also pick up https://gerrit.wikimedia.org/r/1251074 (since it went to prod only), going ahead with that [23:45:30] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [23:45:41] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [23:48:59] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [23:49:25] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [23:49:39] (03PS3) 10Eevans: charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) [23:49:39] (03PS3) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) [23:50:50] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4042.ulsfo.wmnet with OS trixie [23:53:29] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [23:53:46] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [23:56:27] (03PS4) 10Eevans: charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) [23:56:27] (03PS4) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) [23:57:19] (03CR) 10Eevans: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [23:57:32] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host o11ytest1001.eqiad.wmnet with OS trixie