[00:03:06] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [00:03:30] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [00:03:31] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host contint1003.wikimedia.org with OS trixie [00:03:46] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2095.codfw.wmnet with reason: host reimage [00:03:46] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11695586 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host contint100... [00:08:37] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11695589 (10VRiley-WMF) Was able to complete this after speaking with @Jhancock.wm Thank you! @Dzahn It should be c... [00:21:16] jouncebot: nowandnext [00:21:17] No deployments scheduled for the next 5 hour(s) and 38 minute(s) [00:21:17] In 5 hour(s) and 38 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T0600) [00:21:22] (03CR) 10Zabe: [C:03+2] Stop setting $wgImageLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250117 (https://phabricator.wikimedia.org/T299953) (owner: 10Zabe) [00:22:24] (03Merged) 10jenkins-bot: Stop setting $wgImageLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250117 (https://phabricator.wikimedia.org/T299953) (owner: 10Zabe) [00:23:16] zabe: Congratulations! Great work. [00:23:32] Thanks:) [00:24:18] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1250117|Stop setting $wgImageLinksSchemaMigrationStage (T299953)]] [00:24:22] T299953: Normalize imagelinks table - https://phabricator.wikimedia.org/T299953 [00:26:27] !log zabe@deploy2002 zabe: Backport for [[gerrit:1250117|Stop setting $wgImageLinksSchemaMigrationStage (T299953)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:29:59] !log zabe@deploy2002 zabe: Continuing with sync [00:30:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1249460 (owner: 10TrainBranchBot) [00:33:56] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1250117|Stop setting $wgImageLinksSchemaMigrationStage (T299953)]] (duration: 09m 38s) [00:34:00] T299953: Normalize imagelinks table - https://phabricator.wikimedia.org/T299953 [00:34:14] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:39:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1250131 [00:39:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1250131 (owner: 10TrainBranchBot) [00:44:15] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:47:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:51:42] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1250131 (owner: 10TrainBranchBot) [00:56:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11695622 (10Jhancock.wm) [00:59:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11695626 (10Jhancock.wm) @MatthewVernon both these are having an issue at this step. puppet files might need an adjustment. [12/60, retrying in 360.00s] Attempt to run... [01:08:10] (03CR) 10Zabe: [C:03+2] "retry" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1250131 (owner: 10TrainBranchBot) [01:09:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1250136 [01:09:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1250136 (owner: 10TrainBranchBot) [01:20:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1250131 (owner: 10TrainBranchBot) [01:26:27] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1250136 (owner: 10TrainBranchBot) [02:00:51] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:08:51] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 07m 59s) [02:08:55] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:15] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:33:55] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:48] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:30:17] PROBLEM - Ensure traffic_manager is running for instance backend on cp1100 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:31:17] RECOVERY - Ensure traffic_manager is running for instance backend on cp1100 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:08:43] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647 (10Papaul) 03NEW [04:09:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11695826 (10Papaul) p:05Triage→03High a:05cmooney→03ayounsi [04:34:14] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:44:15] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:47:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:54] (03PS1) 10TChin: Add stream config for attribution research [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250249 (https://phabricator.wikimedia.org/T417050) [05:10:26] (03PS4) 10Ryan Kemper: Add new active-active discovery service for dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1248605 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [05:10:34] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248605 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [05:14:19] (03PS5) 10Ryan Kemper: Add new active-active discovery service for dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1248605 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [05:14:27] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248605 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [05:26:33] (03PS1) 10Clare Ming: Remove mpic redirects to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1250250 (https://phabricator.wikimedia.org/T415845) [05:28:01] (03PS6) 10Ryan Kemper: Add new active-active discovery service for dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1248605 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [05:29:47] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248605 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [05:57:09] (03CR) 10Ryan Kemper: "Checked this thoroughly, it's perfect. Made a couple tiny amendments to commit message but otherwise this is ready to ship next Tuesday (o" [puppet] - 10https://gerrit.wikimedia.org/r/1248605 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T0600) [06:08:06] (03CR) 10Ayounsi: [C:03+2] Add more depool strategies for rack depool cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1249958 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [06:14:15] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:15:22] 10ops-magru: Inbound errors on interface cr1-magru:xe-0/1/1 (Transport: cr2-eqiad:xe-1/0/1:3 (Telxius, CRT-008508) {#70089}) - https://phabricator.wikimedia.org/T413409#11695960 (10ayounsi) a:03RobH Rob, could you investigate those as well. Same as {T415743}. Please sync up with us to drain the link ahead of t... [06:16:00] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11695964 (10ayounsi) Awesome, thx!! [06:33:12] (03PS1) 10Kevin Bazira: ml: add aiter support to vLLM 0.14 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1250416 (https://phabricator.wikimedia.org/T419650) [06:59:48] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:00:05] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T0700) [07:00:05] katherine_g and Msz2001: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:04:28] o/ [07:06:30] o/ [07:06:46] katherine_g: Do you need a deployer? [07:07:11] Msz2001: I can go ahead- starting now [07:07:22] ack [07:07:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kgraessle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247639 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [07:08:53] (03Merged) 10jenkins-bot: Enable rr-ml AutoModerator CC Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247639 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [07:09:48] !log kgraessle@deploy2002 Started scap sync-world: Backport for [[gerrit:1247639|Enable rr-ml AutoModerator CC Set AutoModeratorMultiLingualRevertRisk with available wikis (T400727)]] [07:09:50] (03PS1) 10Mszwarc: Drop underscore from titles in wgOATH2FARequiredGroupRemovalPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250426 [07:09:52] T400727: set AutoModeratorMultiLingualRevertRisk with available wikis - https://phabricator.wikimedia.org/T400727 [07:12:01] !log kgraessle@deploy2002 kgraessle: Backport for [[gerrit:1247639|Enable rr-ml AutoModerator CC Set AutoModeratorMultiLingualRevertRisk with available wikis (T400727)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:18:15] !log kgraessle@deploy2002 kgraessle: Continuing with sync [07:22:12] !log kgraessle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247639|Enable rr-ml AutoModerator CC Set AutoModeratorMultiLingualRevertRisk with available wikis (T400727)]] (duration: 12m 24s) [07:22:16] T400727: set AutoModeratorMultiLingualRevertRisk with available wikis - https://phabricator.wikimedia.org/T400727 [07:22:46] Msz2001: over to you [07:23:50] (03PS3) 10Mszwarc: Display list of 2FA-req. groups on AccountSecurity for 2FA-less users [extensions/OATHAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249921 (https://phabricator.wikimedia.org/T419422) [07:24:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/OATHAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249921 (https://phabricator.wikimedia.org/T419422) (owner: 10Mszwarc) [07:24:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250066 (https://phabricator.wikimedia.org/T419111) (owner: 10Mszwarc) [07:24:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250426 (owner: 10Mszwarc) [07:25:29] (03PS3) 10AKhatun: stream: mediawiki.page_edit_type_simple [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249360 (https://phabricator.wikimedia.org/T351225) [07:27:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir4003.ulsfo.wmnet [07:27:47] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:28:00] (03Merged) 10jenkins-bot: Display list of 2FA-req. groups on AccountSecurity for 2FA-less users [extensions/OATHAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249921 (https://phabricator.wikimedia.org/T419422) (owner: 10Mszwarc) [07:28:52] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696068 (10ayounsi) Using this opportunity to test my WIP rack depool cookbook (only in "show" mode). More info in {T327300} That's the current status of what... [07:33:17] (03PS1) 10Muehlenhoff: profile::server_depool: Annotate maps [puppet] - 10https://gerrit.wikimedia.org/r/1250431 [07:34:11] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir4003.ulsfo.wmnet - jmm@cumin2002" [07:34:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir4003.ulsfo.wmnet - jmm@cumin2002" [07:34:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:34:17] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir4003.ulsfo.wmnet on all recursors [07:34:21] (03PS2) 10Muehlenhoff: profile::server_depool: Annotate maps [puppet] - 10https://gerrit.wikimedia.org/r/1250431 [07:34:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir4003.ulsfo.wmnet on all recursors [07:34:50] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir4003.ulsfo.wmnet - jmm@cumin2002" [07:34:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir4003.ulsfo.wmnet - jmm@cumin2002" [07:36:00] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11696079 (10MoritzMuehlenhoff) [07:36:13] (03Merged) 10jenkins-bot: Send2FAWarningNotifications: Support reading users from file [extensions/WikimediaMessages] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250066 (https://phabricator.wikimedia.org/T419111) (owner: 10Mszwarc) [07:36:46] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1249921|Display list of 2FA-req. groups on AccountSecurity for 2FA-less users (T419422)]], [[gerrit:1250066|Send2FAWarningNotifications: Support reading users from file (T419111)]] [07:36:51] T419422: Display a list of 2FA-requiring groups on Special:AccountSecurity if user has no 2FA configured - https://phabricator.wikimedia.org/T419422 [07:36:52] T419111: Send Echo notification to 2FA-less users who are required to have 2FA - https://phabricator.wikimedia.org/T419111 [07:37:56] jmm@cumin2002 makevm (PID 1407536) is awaiting input [07:38:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir4003.ulsfo.wmnet with OS bookworm [07:40:08] (03PS2) 10AKhatun: topic: mw-page-edit-type-enrich-next [puppet] - 10https://gerrit.wikimedia.org/r/1249957 (https://phabricator.wikimedia.org/T351225) [07:42:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet [07:43:02] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696088 (10ops-monitoring-bot) Draining ganeti1033.eqiad.wmnet of running VMs [07:43:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1033.eqiad.wmnet [07:43:40] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696089 (10MoritzMuehlenhoff) [07:44:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1049.eqiad.wmnet [07:44:38] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1049.eqiad.wmnet [07:46:41] (03CR) 10Elukey: sre.hosts.provision: allow no-pxe settings for NIC on Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1249973 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [07:47:39] (03CR) 10Dpogorzelski: [C:03+1] ml: add aiter support to vLLM 0.14 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1250416 (https://phabricator.wikimedia.org/T419650) (owner: 10Kevin Bazira) [07:47:43] (03CR) 10Elukey: [C:03+1] profile::server_depool: Annotate maps [puppet] - 10https://gerrit.wikimedia.org/r/1250431 (owner: 10Muehlenhoff) [07:49:31] (03CR) 10Muehlenhoff: [C:03+2] profile::server_depool: Annotate maps [puppet] - 10https://gerrit.wikimedia.org/r/1250431 (owner: 10Muehlenhoff) [07:50:56] (03PS1) 10Arnaudb: mailman: update helo data to use lists1004.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1250433 (https://phabricator.wikimedia.org/T286066) [07:51:04] (03CR) 10Elukey: ml: add aiter support to vLLM 0.14 image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1250416 (https://phabricator.wikimedia.org/T419650) (owner: 10Kevin Bazira) [07:51:45] (03CR) 10Elukey: [C:03+2] installserver: update preseed config for ml-serve101[4,5] [puppet] - 10https://gerrit.wikimedia.org/r/1249984 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [07:51:52] (03CR) 10Dpogorzelski: [C:03+2] ml: add aiter support to vLLM 0.14 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1250416 (https://phabricator.wikimedia.org/T419650) (owner: 10Kevin Bazira) [07:51:55] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml: add aiter support to vLLM 0.14 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1250416 (https://phabricator.wikimedia.org/T419650) (owner: 10Kevin Bazira) [07:52:40] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind CDN - https://phabricator.wikimedia.org/T286066#11696121 (10ABran-WMF) good catch @taavi, thanks! I've sent [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1250433 | a CR ]]... [07:55:13] (03CR) 10Kevin Bazira: ml: add aiter support to vLLM 0.14 image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1250416 (https://phabricator.wikimedia.org/T419650) (owner: 10Kevin Bazira) [07:56:08] (03CR) 10Elukey: sre.hosts.provision: add safeguard for typoes in serials (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1249971 (owner: 10Ayounsi) [07:56:08] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1249921|Display list of 2FA-req. groups on AccountSecurity for 2FA-less users (T419422)]], [[gerrit:1250066|Send2FAWarningNotifications: Support reading users from file (T419111)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:56:13] T419422: Display a list of 2FA-requiring groups on Special:AccountSecurity if user has no 2FA configured - https://phabricator.wikimedia.org/T419422 [07:56:13] T419111: Send Echo notification to 2FA-less users who are required to have 2FA - https://phabricator.wikimedia.org/T419111 [07:57:06] !log mszwarc@deploy2002 mszwarc: Continuing with sync [07:58:13] For the record, there are "Cannot access the database: could not connect to any replica DB server" errors, but they don't seem related to this patch (and they have been appearing earlier as well) [07:58:40] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir4003.ulsfo.wmnet with reason: host reimage [08:03:56] (03PS1) 10Muehlenhoff: Add netflow4003 [puppet] - 10https://gerrit.wikimedia.org/r/1250499 (https://phabricator.wikimedia.org/T418993) [08:04:17] (03PS2) 10Muehlenhoff: Add netflow4003 [puppet] - 10https://gerrit.wikimedia.org/r/1250499 (https://phabricator.wikimedia.org/T418993) [08:04:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir4003.ulsfo.wmnet with reason: host reimage [08:04:35] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1014.eqiad.wmnet with OS trixie [08:05:01] !log installing mariadb bugfix updates from Bookworm point release (tools and libraries as packaged in Debian, unrelated to the wmf-mariadb packages) [08:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:25] (03CR) 10Jelto: "one comment in-line. Beside that this change looks reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/1250433 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [08:07:06] (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment, to speed up things" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250426 (owner: 10Mszwarc) [08:07:58] (03Merged) 10jenkins-bot: Drop underscore from titles in wgOATH2FARequiredGroupRemovalPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250426 (owner: 10Mszwarc) [08:09:53] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1249921|Display list of 2FA-req. groups on AccountSecurity for 2FA-less users (T419422)]], [[gerrit:1250066|Send2FAWarningNotifications: Support reading users from file (T419111)]] (duration: 33m 07s) [08:09:58] T419422: Display a list of 2FA-requiring groups on Special:AccountSecurity if user has no 2FA configured - https://phabricator.wikimedia.org/T419422 [08:09:58] T419111: Send Echo notification to 2FA-less users who are required to have 2FA - https://phabricator.wikimedia.org/T419111 [08:10:17] (03CR) 10Muehlenhoff: installserver: update preseed config for ml-serve101[4,5] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1249984 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [08:10:35] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1250426|Drop underscore from titles in wgOATH2FARequiredGroupRemovalPages]] [08:11:09] (03CR) 10Brouberol: [C:03+2] Bump Airflow image to include missing jars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249936 (https://phabricator.wikimedia.org/T415874) (owner: 10Aqu) [08:14:47] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1250426|Drop underscore from titles in wgOATH2FARequiredGroupRemovalPages]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:15:16] !log mszwarc@deploy2002 mszwarc: Continuing with sync [08:17:05] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1014.eqiad.wmnet with reason: host reimage [08:19:16] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11696204 (10MoritzMuehlenhoff) [08:20:08] (03PS2) 10Arnaudb: mailman: update helo data to use lists1004.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1250433 (https://phabricator.wikimedia.org/T286066) [08:20:08] (03CR) 10Arnaudb: "good catch! this will be handled by https://gerrit.wikimedia.org/r/c/operations/dns/+/1249310 which I intended to merge today" [puppet] - 10https://gerrit.wikimedia.org/r/1250433 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [08:21:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir4003.ulsfo.wmnet with OS bookworm [08:21:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir4003.ulsfo.wmnet [08:21:16] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696205 (10MatthewVernon) [08:21:21] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1250426|Drop underscore from titles in wgOATH2FARequiredGroupRemovalPages]] (duration: 10m 46s) [08:21:38] (03PS1) 10Muehlenhoff: Update netflow collector for ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1250505 (https://phabricator.wikimedia.org/T418993) [08:21:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir4004.ulsfo.wmnet [08:21:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:21:48] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:21:56] !log UTC morning backport window finished [08:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:58] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1250433 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [08:22:24] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696207 (10MatthewVernon) Can I check this is 15:00 UTC (particularly given daylight confusion...), please? Once it's done I'll check ms-be1091 [the frontends c... [08:22:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:23:26] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1014.eqiad.wmnet with reason: host reimage [08:24:25] (03PS1) 10Muehlenhoff: Add ncredir4003/ncredir4004 [puppet] - 10https://gerrit.wikimedia.org/r/1250506 (https://phabricator.wikimedia.org/T418993) [08:24:58] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#11696209 (10Ben.buchenau) Hi Andrea! Thanks for picking this up. A good name would be: **wmde_goal_monitoring**. Best, Ben [08:25:21] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir4004.ulsfo.wmnet - jmm@cumin2002" [08:26:28] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696225 (10MoritzMuehlenhoff) [08:28:26] jmm@cumin2002 makevm (PID 1420291) is awaiting input [08:29:19] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1249995 (owner: 10Majavah) [08:30:10] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696229 (10ayounsi) >>! In T419647#11696205, @MatthewVernon wrote: > Can I check this is 15:00 UTC (particularly given daylight confusion...), please? Once it's... [08:30:52] (03CR) 10Muehlenhoff: "If you want to merge, please go ahead! Otherwise I'll do it myself when time permits." [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [08:31:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir4004.ulsfo.wmnet - jmm@cumin2002" [08:31:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:31:48] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir4004.ulsfo.wmnet on all recursors [08:31:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir4004.ulsfo.wmnet on all recursors [08:32:20] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir4004.ulsfo.wmnet - jmm@cumin2002" [08:32:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir4004.ulsfo.wmnet - jmm@cumin2002" [08:33:44] (03CR) 10Jelto: [C:03+1] "looks better now, thank you. The PCC diff shows another `helo_data = lists.wikimedia.org`. I'm not sure if that needs to be updated as wel" [puppet] - 10https://gerrit.wikimedia.org/r/1250433 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [08:34:14] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:35:26] jmm@cumin2002 makevm (PID 1420291) is awaiting input [08:35:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir4004.ulsfo.wmnet with OS bookworm [08:36:58] (03PS3) 10Ayounsi: sre.hosts.provision: add safeguard for typoes in serials [cookbooks] - 10https://gerrit.wikimedia.org/r/1249971 [08:37:16] (03CR) 10Ayounsi: sre.hosts.provision: add safeguard for typoes in serials (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1249971 (owner: 10Ayounsi) [08:39:30] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [08:40:23] (03CR) 10Tiziano Fogli: [C:03+2] alertmanager/o11y: add route to handle alerts with severity=task [puppet] - 10https://gerrit.wikimedia.org/r/1249349 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [08:40:34] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [08:40:34] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1014.eqiad.wmnet with OS trixie [08:41:12] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1015.eqiad.wmnet with OS trixie [08:41:22] (03CR) 10Ayounsi: [C:03+1] "change lgtm but I don't have the authority to put those hosts to prod" [puppet] - 10https://gerrit.wikimedia.org/r/1250506 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [08:43:13] (03CR) 10Elukey: [C:03+1] sre.hosts.provision: add safeguard for typoes in serials [cookbooks] - 10https://gerrit.wikimedia.org/r/1249971 (owner: 10Ayounsi) [08:44:15] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:45:03] 10ops-eqiad, 06DC-Ops: Power Supply - Status - issue on wdqs1025:9290 - https://phabricator.wikimedia.org/T419664 (10phaultfinder) 03NEW [08:48:39] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:32] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696297 (10MatthewVernon) Ah, I just put `10:00 EST` into `date`. You're probably right, but a confirmation would be helpful :) [08:52:52] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1015.eqiad.wmnet with reason: host reimage [08:54:02] !log installing imagemagick security updates [08:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:39] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2239.codfw.wmnet with reason: mysql upgrade / restart [08:56:31] (03PS2) 10Kgraessle: Enable AutoModerator on Italian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192921 (https://phabricator.wikimedia.org/T405152) [08:56:43] (03PS3) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249959 (https://phabricator.wikimedia.org/T419258) [08:57:34] (03CR) 10Vgutierrez: Add ncredir4003/ncredir4004 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1250506 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [08:58:33] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir4004.ulsfo.wmnet with reason: host reimage [08:58:55] !log trueg@deploy2002 helmfile [staging] START helmfile.d/services/SERVICE_NAME: apply [08:58:57] !log trueg@deploy2002 helmfile [staging] DONE helmfile.d/services/SERVICE_NAME: apply [08:59:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1015.eqiad.wmnet with reason: host reimage [09:00:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by javiermonton@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249217 (https://phabricator.wikimedia.org/T419258) (owner: 10JavierMonton) [09:01:04] (03Merged) 10jenkins-bot: stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249217 (https://phabricator.wikimedia.org/T419258) (owner: 10JavierMonton) [09:01:28] (03CR) 10Muehlenhoff: Add ncredir4003/ncredir4004 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1250506 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [09:01:33] !log javiermonton@deploy2002 Started scap sync-world: Backport for [[gerrit:1249217|stream: mediawiki.page_html_content_change (T419258)]] [09:01:35] (03CR) 10Tiziano Fogli: [C:03+2] prometheus: add cardinality explosion alerts [alerts] - 10https://gerrit.wikimedia.org/r/1248866 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [09:01:37] T419258: Adatp HTML pipeline to the new diffs schema - https://phabricator.wikimedia.org/T419258 [09:02:34] (03PS2) 10Trueg: deployment_server: Add wdqs-queryhammer service [puppet] - 10https://gerrit.wikimedia.org/r/1249918 (https://phabricator.wikimedia.org/T417415) [09:03:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir4004.ulsfo.wmnet with reason: host reimage [09:03:36] !log javiermonton@deploy2002 javiermonton: Backport for [[gerrit:1249217|stream: mediawiki.page_html_content_change (T419258)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:03:37] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2095.codfw.wmnet with OS bullseye [09:03:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11696322 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2095.codfw.wmnet with OS bullseye [09:05:27] (03PS4) 10Kgraessle: Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) [09:06:04] !log javiermonton@deploy2002 javiermonton: Continuing with sync [09:07:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2096.codfw.wmnet with OS bullseye [09:08:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11696333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2096.codfw.wmnet with OS bullseye [09:08:58] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11696334 (10MatthewVernon) Hi @Jhancock.wm I'm afraid this is the problem we've seen with Dell before (but that I hoped they were going to correct), where they send us sy... [09:10:01] !log javiermonton@deploy2002 Finished scap sync-world: Backport for [[gerrit:1249217|stream: mediawiki.page_html_content_change (T419258)]] (duration: 08m 28s) [09:10:05] T419258: Adatp HTML pipeline to the new diffs schema - https://phabricator.wikimedia.org/T419258 [09:11:16] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249959 (https://phabricator.wikimedia.org/T419258) (owner: 10JavierMonton) [09:12:01] (03PS1) 10Muehlenhoff: ncredir: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1250517 [09:12:01] (03PS1) 10Gkyziridis: ml-services: Deploy new version of edit-check in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250518 (https://phabricator.wikimedia.org/T419527) [09:13:12] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249959 (https://phabricator.wikimedia.org/T419258) (owner: 10JavierMonton) [09:13:40] (03CR) 10Muehlenhoff: Add ncredir4003/ncredir4004 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1250506 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [09:14:30] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:15:13] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:15:16] (03PS1) 10Muehlenhoff: ncredir4003/4004: Change back to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1250520 (https://phabricator.wikimedia.org/T418993) [09:15:22] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:17:22] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy new version of edit-check in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250518 (https://phabricator.wikimedia.org/T419527) (owner: 10Gkyziridis) [09:17:35] elukey@cumin1003 reimage (PID 2828195) is awaiting input [09:18:46] (03PS4) 10Ayounsi: sre.hosts.provision: add safeguard for typoes in serials [cookbooks] - 10https://gerrit.wikimedia.org/r/1249971 [09:19:29] (03Merged) 10jenkins-bot: ml-services: Deploy new version of edit-check in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250518 (https://phabricator.wikimedia.org/T419527) (owner: 10Gkyziridis) [09:19:54] (03PS5) 10Ayounsi: sre.hosts.provision: add safeguard for typoes in serials [cookbooks] - 10https://gerrit.wikimedia.org/r/1249971 [09:19:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir4004.ulsfo.wmnet with OS bookworm [09:19:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir4004.ulsfo.wmnet [09:21:00] (03Abandoned) 10Kgraessle: Enable rr-ml AutoModerator CC form on !large wikis Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [09:21:52] (03CR) 10Vgutierrez: [C:03+1] ncredir4003/4004: Change back to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1250520 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [09:22:11] !log gkyziridis@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [09:22:24] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1249522 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [09:22:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1250517 (owner: 10Muehlenhoff) [09:23:09] (03PS2) 10Muehlenhoff: ncredir: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1250517 [09:24:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2095.codfw.wmnet with reason: host reimage [09:24:12] (03CR) 10Arnaudb: "this is weird, we have 2 templates in that dir:" [puppet] - 10https://gerrit.wikimedia.org/r/1250433 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [09:25:12] (03PS6) 10Ayounsi: sre.hosts.provision: add safeguard for typoes in serials [cookbooks] - 10https://gerrit.wikimedia.org/r/1249971 [09:25:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [09:26:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [09:27:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:27:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2095.codfw.wmnet with reason: host reimage [09:28:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:28:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2096.codfw.wmnet with reason: host reimage [09:28:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [09:29:10] (03PS4) 10JMeybohm: Remove PSP related code from admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248823 (https://phabricator.wikimedia.org/T273507) [09:29:18] (03CR) 10JMeybohm: [C:03+2] kubernetes: Don't re-define default admission_plugins [puppet] - 10https://gerrit.wikimedia.org/r/1248812 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [09:29:43] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [09:30:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [09:30:05] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: add safeguard for typoes in serials [cookbooks] - 10https://gerrit.wikimedia.org/r/1249971 (owner: 10Ayounsi) [09:30:20] (03CR) 10JMeybohm: [C:03+2] Remove istio 1.15 wikikube config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248822 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:30:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [09:31:30] (03PS7) 10Ayounsi: sre.hosts.provision: add safeguard for typoes in serials [cookbooks] - 10https://gerrit.wikimedia.org/r/1249971 [09:31:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [09:32:04] (03Merged) 10jenkins-bot: Remove istio 1.15 wikikube config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248822 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:32:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2096.codfw.wmnet with reason: host reimage [09:32:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [09:32:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [09:32:56] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250523 (https://phabricator.wikimedia.org/T419258) [09:32:58] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696410 (10MoritzMuehlenhoff) [09:33:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [09:34:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [09:35:00] (03PS1) 10JMeybohm: Fix PodSecurityPolicy related comments [puppet] - 10https://gerrit.wikimedia.org/r/1250524 (https://phabricator.wikimedia.org/T273507) [09:35:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [09:35:08] (03CR) 10Ayounsi: [C:03+1] Add netflow4003 [puppet] - 10https://gerrit.wikimedia.org/r/1250499 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [09:35:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [09:35:48] (03CR) 10Muehlenhoff: [C:03+2] Add netflow4003 [puppet] - 10https://gerrit.wikimedia.org/r/1250499 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [09:35:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [09:36:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [09:37:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [09:37:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [09:37:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [09:38:26] (03CR) 10Ayounsi: [C:03+1] Update netflow collector for ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1250505 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [09:39:22] (03CR) 10Arnaudb: [C:03+2] mailman: update helo data to use lists1004.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1250433 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [09:41:48] (03CR) 10Gmodena: [C:03+1] deployment_server: Add wdqs-queryhammer service [puppet] - 10https://gerrit.wikimedia.org/r/1249918 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [09:46:28] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [09:47:51] (03CR) 10Blake: [C:03+2] switchdc: update set-readonly comment [cookbooks] - 10https://gerrit.wikimedia.org/r/1249322 (https://phabricator.wikimedia.org/T418133) (owner: 10Blake) [09:49:33] mvernon@cumin2002 reimage (PID 1430451) is awaiting input [09:51:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [09:51:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2095.codfw.wmnet with OS bullseye [09:51:22] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11696480 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2095.codfw.wmnet with OS bullseye completed: - ms-be2095 (... [09:51:48] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [09:52:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [09:52:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2096.codfw.wmnet with OS bullseye [09:52:28] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11696483 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2096.codfw.wmnet with OS bullseye completed: - ms-be2096 (... [09:53:07] (03Merged) 10jenkins-bot: switchdc: update set-readonly comment [cookbooks] - 10https://gerrit.wikimedia.org/r/1249322 (https://phabricator.wikimedia.org/T418133) (owner: 10Blake) [09:53:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#11696491 (10BTullis) >>! In T390734#11687686, @Ben.buchenau wrote: > Hello guys - follow-up request regarding Kerebos authentication: Can I get... [09:59:39] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2096.codfw.wmnet with OS bullseye [09:59:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11696500 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-be2096.codfw.wmnet with OS bullseye completed: - ms-be2096... [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T1000) [10:01:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11696506 (10MatthewVernon) Imaging of both systems was OK once the relevant disk got wiped. [10:01:30] (03CR) 10Elukey: [C:03+1] sre.hosts.provision: add safeguard for typoes in serials [cookbooks] - 10https://gerrit.wikimedia.org/r/1249971 (owner: 10Ayounsi) [10:01:47] !log elukey@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [10:01:47] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1015.eqiad.wmnet with OS trixie [10:01:58] (03CR) 10Ayounsi: [C:03+2] sre.hosts.provision: add safeguard for typoes in serials [cookbooks] - 10https://gerrit.wikimedia.org/r/1249971 (owner: 10Ayounsi) [10:02:55] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Failed step after ml-serve1015's reimage - elukey@cumin1003" [10:02:59] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Failed step after ml-serve1015's reimage - elukey@cumin1003" [10:03:16] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250523 (https://phabricator.wikimedia.org/T419258) (owner: 10JavierMonton) [10:04:09] (03CR) 10Clément Goubert: [C:03+1] Add Chart.yaml metadata for ServiceOps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249996 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [10:05:11] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250523 (https://phabricator.wikimedia.org/T419258) (owner: 10JavierMonton) [10:06:16] 06SRE, 06ServiceOps new, 10Wikibase GraphQL, 06Wikibase Reuse Team, and 2 others: Create a rewrite for the GraphQL endpoint on wikidata.org - https://phabricator.wikimedia.org/T417026#11696531 (10Clement_Goubert) Ack, thanks for following up. [10:06:53] (03CR) 10Clément Goubert: [C:03+1] api-gateway: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249259 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [10:07:28] (03Merged) 10jenkins-bot: sre.hosts.provision: add safeguard for typoes in serials [cookbooks] - 10https://gerrit.wikimedia.org/r/1249971 (owner: 10Ayounsi) [10:08:38] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2095.codfw.wmnet with OS bullseye [10:08:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11696536 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-be2095.codfw.wmnet with OS bullseye completed: - ms-be2095... [10:14:58] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:17:48] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696558 (10taavi) [10:23:06] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11696564 (10ayounsi) 05Open→03Resolved All done. [10:23:33] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2071.codfw.wmnet with OS bullseye [10:23:46] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11696570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2071.codfw.wmnet with OS bullseye [10:24:04] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2071 [10:26:14] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [10:30:55] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696597 (10BTullis) [10:31:35] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696605 (10BTullis) [10:31:53] mvernon@cumin2002 reimage (PID 1451721) is awaiting input [10:34:39] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2071 - mvernon@cumin2002" [10:34:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2071 - mvernon@cumin2002" [10:34:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:34:46] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2071.codfw.wmnet 221.16.192.10.in-addr.arpa 1.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:34:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2071.codfw.wmnet 221.16.192.10.in-addr.arpa 1.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:34:51] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2071 [10:35:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2071 [10:35:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2071 [10:36:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Recommendation-API, 13Patch-For-Review: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11696619 (10elukey) Hosts provisioned and reimaged to Trixie :) @klausman @DPogorzelski-WMF - the host are now running with a base role that should al... [10:38:00] (03CR) 10Tiziano Fogli: [C:03+2] prom4003: assign prometheus::pop role [puppet] - 10https://gerrit.wikimedia.org/r/1249910 (https://phabricator.wikimedia.org/T419430) (owner: 10Tiziano Fogli) [10:38:13] (03CR) 10Tiziano Fogli: [C:03+2] "I'm self-merging since the operations have already been tested on other instances." [puppet] - 10https://gerrit.wikimedia.org/r/1249910 (https://phabricator.wikimedia.org/T419430) (owner: 10Tiziano Fogli) [10:38:31] (03CR) 10Tiziano Fogli: [C:03+2] prom4003: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1249911 (https://phabricator.wikimedia.org/T419430) (owner: 10Tiziano Fogli) [10:38:36] (03CR) 10Tiziano Fogli: [C:03+2] "I'm self-merging since the operations have already been tested on other instances." [puppet] - 10https://gerrit.wikimedia.org/r/1249911 (https://phabricator.wikimedia.org/T419430) (owner: 10Tiziano Fogli) [10:38:55] (03CR) 10Tiziano Fogli: [C:03+2] "I'm self-merging since the operations have already been tested on other instances." [puppet] - 10https://gerrit.wikimedia.org/r/1249912 (https://phabricator.wikimedia.org/T419430) (owner: 10Tiziano Fogli) [10:40:08] (03PS1) 10Mvolz: Update zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250534 [10:45:40] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11696664 (10ayounsi) [10:46:18] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11696665 (10BTullis) Will all of the switches in rows C & D be getting this configuration change? I'm asking because I've got another host that is exhibiting a reimage... [10:46:24] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11696666 (10ayounsi) I added rough network numbers. [10:53:44] (03CR) 10Milimetric: Add stream config for attribution research (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250249 (https://phabricator.wikimedia.org/T417050) (owner: 10TChin) [10:54:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2071.codfw.wmnet with reason: host reimage [10:58:19] (03CR) 10Effie Mouzeli: [C:03+2] api-gateway: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249259 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [10:58:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2071.codfw.wmnet with reason: host reimage [10:59:58] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:00:05] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T1100). nyaa~ [11:00:31] (03Merged) 10jenkins-bot: api-gateway: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249259 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [11:02:37] (03CR) 10Mvolz: [C:03+2] Update zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250534 (owner: 10Mvolz) [11:02:37] (03CR) 10Effie Mouzeli: [C:03+2] Add Chart.yaml metadata for ServiceOps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249996 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [11:05:04] (03Merged) 10jenkins-bot: Update zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250534 (owner: 10Mvolz) [11:05:08] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [11:05:19] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:06:06] (03PS1) 10Urbanecm: [Growth] kaiwiki: Enable GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250539 (https://phabricator.wikimedia.org/T304052) [11:06:56] (03PS6) 10Urbanecm: [Growth] Enable on every new Wikipedia by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239954 (https://phabricator.wikimedia.org/T304052) [11:08:10] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [11:08:32] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:08:56] Mvolz: any objections to me making a MW deployment, or should i wait? [11:09:36] (03CR) 10Tiziano Fogli: [C:03+2] "I'm self-merging since the operations have already been tested on other instances." [puppet] - 10https://gerrit.wikimedia.org/r/1249913 (https://phabricator.wikimedia.org/T419430) (owner: 10Tiziano Fogli) [11:10:13] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:10:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11696788 (10BTullis) Thanks @VRiley-WMF - You can replace this at any time. This drive is a member of a hardware RAID10 volume, so we're not going to lose... [11:10:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11696790 (10BTullis) p:05Triage→03Low [11:10:41] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:10:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11696791 (10BTullis) [11:12:13] (03CR) 10Phuedx: [C:03+1] Remove `MetricsPlatform` configuration from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) (owner: 10Santiago Faci) [11:12:15] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11696804 (10BTullis) Hi @Jclark-ctr - Feel free to replace the drive whenever is convenient. [11:12:51] (03Merged) 10jenkins-bot: Add Chart.yaml metadata for ServiceOps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249996 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [11:13:39] RESOLVED: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:14:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11696810 (10BTullis) @Jclark-ctr - Please feel free to swap the drive at any time. I'm not seeing anything reported by slot 10, but I'll check ag... [11:17:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250539 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm) [11:18:19] (03PS1) 10Mvolz: Revert "Update zotero" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250542 [11:18:30] (03CR) 10Mvolz: [C:03+2] Revert "Update zotero" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250542 (owner: 10Mvolz) [11:18:48] !log urbanecm@deploy2002 mwscript-k8s job started: WikimediaMaintenance:createExtensionTables.php --wiki=kaiwiki growthexperiments # T304052 [11:18:53] T304052: Enable Growth features on Wikipedias upon creation - https://phabricator.wikimedia.org/T304052 [11:19:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2071.codfw.wmnet with OS bullseye [11:19:16] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11696852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2071.codfw.wmnet with OS bullseye compl... [11:19:20] (03Merged) 10jenkins-bot: [Growth] kaiwiki: Enable GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250539 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm) [11:19:39] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [11:19:42] (03CR) 10Clément Goubert: trafficserver: Support fractional routing for api.w.o (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1245389 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [11:19:47] /43/31 [11:19:50] ugh [11:19:51] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1250539|[Growth] kaiwiki: Enable GrowthExperiments (T304052)]] [11:20:26] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/ulsfo: update svc record [dns] - 10https://gerrit.wikimedia.org/r/1249915 (https://phabricator.wikimedia.org/T419430) (owner: 10Tiziano Fogli) [11:20:36] (03Merged) 10jenkins-bot: Revert "Update zotero" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250542 (owner: 10Mvolz) [11:21:14] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:21:27] !log tappof@dns1004 START - running authdns-update [11:21:41] (03PS4) 10Clément Goubert: trafficserver: Support fractional routing for api.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1245389 (https://phabricator.wikimedia.org/T418145) [11:21:43] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:21:48] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [11:21:56] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1250539|[Growth] kaiwiki: Enable GrowthExperiments (T304052)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:22:07] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:22:19] (03CR) 10Jdlrobson: Enable personal main menu to all users in Minerva Neue skin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240012 (https://phabricator.wikimedia.org/T413912) (owner: 10Bernard Wang) [11:22:49] !log tappof@dns1004 END - running authdns-update [11:23:08] (03PS5) 10Clément Goubert: trafficserver: Support fractional routing for api.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1245389 (https://phabricator.wikimedia.org/T418145) [11:23:45] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2072.codfw.wmnet with OS bullseye [11:23:55] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11696893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2072.codfw.wmnet with OS bullseye [11:24:04] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2072 [11:24:05] (03PS6) 10Clément Goubert: trafficserver: Support fractional routing for api.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1245389 (https://phabricator.wikimedia.org/T418145) [11:24:22] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [11:24:58] FIRING: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:26:21] !log urbanecm@deploy2002 mwscript-k8s job started: WikimediaMaintenance:createExtensionTables.php --wiki=kaiwiki growthexperiments # T304052 [11:26:25] T304052: Enable Growth features on Wikipedias upon creation - https://phabricator.wikimedia.org/T304052 [11:26:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [11:27:53] (03CR) 10Clément Goubert: [C:03+2] wmnet: Add api-gateway-ro record [dns] - 10https://gerrit.wikimedia.org/r/1244697 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [11:28:21] !log cgoubert@dns1004 START - running authdns-update [11:28:31] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2072 - mvernon@cumin2002" [11:28:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2072 - mvernon@cumin2002" [11:28:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:28:37] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2072.codfw.wmnet 158.32.192.10.in-addr.arpa 8.5.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:28:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2072.codfw.wmnet 158.32.192.10.in-addr.arpa 8.5.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:28:41] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2072 [11:29:24] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2072 [11:29:24] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2072 [11:29:40] (03PS5) 10Kgraessle: Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) [11:29:44] !log cgoubert@dns1004 END - running authdns-update [11:30:03] !log urbanecm@deploy2002 urbanecm: Continuing with sync [11:30:20] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Add api-gateway-ro to certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244700 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [11:30:39] (03CR) 10Urbanecm: [C:03+2] [Growth] Enable on every new Wikipedia by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239954 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm) [11:30:40] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2073.codfw.wmnet with OS bullseye [11:30:52] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11696929 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2073.codfw.wmnet with OS bullseye [11:30:56] (03PS1) 10Gkyziridis: ml-services: Deploy latest version of edit-check model on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250546 (https://phabricator.wikimedia.org/T419527) [11:31:11] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2073 [11:31:22] (03PS1) 10Muehlenhoff: Set netflow4003 as nftables [puppet] - 10https://gerrit.wikimedia.org/r/1250547 [11:31:34] (03Merged) 10jenkins-bot: [Growth] Enable on every new Wikipedia by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239954 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm) [11:32:07] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [11:32:28] (03Merged) 10jenkins-bot: api-gateway: Add api-gateway-ro to certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244700 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [11:32:53] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:33:22] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [11:33:44] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [11:34:02] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1250539|[Growth] kaiwiki: Enable GrowthExperiments (T304052)]] (duration: 14m 11s) [11:34:06] T304052: Enable Growth features on Wikipedias upon creation - https://phabricator.wikimedia.org/T304052 [11:34:22] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [11:34:39] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1239954|[Growth] Enable on every new Wikipedia by default (T304052)]] [11:34:58] FIRING: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:34:59] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [11:35:28] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [11:36:16] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2073 - mvernon@cumin2002" [11:36:21] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2073 - mvernon@cumin2002" [11:36:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:36:22] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2073.codfw.wmnet 212.48.192.10.in-addr.arpa 2.1.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:36:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2073.codfw.wmnet 212.48.192.10.in-addr.arpa 2.1.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:36:26] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2073 [11:36:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#11696957 (10Ben.buchenau) Hi Ben & Andrea, Thanks for sharing the info! I implemented a DAG last year together with Andrew McAllister for vide... [11:36:46] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1239954|[Growth] Enable on every new Wikipedia by default (T304052)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:37:16] !log urbanecm@deploy2002 urbanecm: Continuing with sync [11:37:55] !log upgrading to acme-chief 0.39 on acme-chief production instances - T419352 [11:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:00] T419352: acme-chief is unable to validate challenges against GTS staging environment - https://phabricator.wikimedia.org/T419352 [11:38:04] (03CR) 10Muehlenhoff: [C:03+2] Set netflow4003 as nftables [puppet] - 10https://gerrit.wikimedia.org/r/1250547 (owner: 10Muehlenhoff) [11:38:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2073 [11:38:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2073 [11:41:18] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1239954|[Growth] Enable on every new Wikipedia by default (T304052)]] (duration: 06m 39s) [11:41:22] T304052: Enable Growth features on Wikipedias upon creation - https://phabricator.wikimedia.org/T304052 [11:42:01] urbanecm: sorry i missed it, go ahead, I'm all done [11:42:17] Mvolz: no worries, thought so. i'm just finishing as well. ty! [11:43:39] RESOLVED: JobUnavailable: Reduced availability for job thanos-sidecar in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:48:38] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2072.codfw.wmnet with reason: host reimage [11:48:39] FIRING: JobUnavailable: Reduced availability for job thanos-sidecar in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:51:22] (03CR) 10Muehlenhoff: [C:03+2] ncredir4003/4004: Change back to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1250520 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [11:52:05] PROBLEM - Host ms-be2072 is DOWN: PING CRITICAL - Packet loss = 100% [11:52:05] (03PS1) 10Anne Tomasevich: Revert "Enable personal main menu to all users in Minerva Neue skin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250558 [11:52:44] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11697057 (10taavi) Should the `policy: local_command` option have a separate setting for a command for re-pooling the node? [11:52:58] (03CR) 10Vgutierrez: trafficserver: Support fractional routing for api.w.o (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1245389 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [11:53:39] jmm@cumin2002 reimage (PID 1472304) is awaiting input [11:54:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2072.codfw.wmnet with reason: host reimage [11:57:07] RECOVERY - Host ms-be2072 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [11:58:00] (03PS2) 10Jdlrobson: Revert "Enable personal main menu to all users in Minerva Neue skin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250558 (https://phabricator.wikimedia.org/T413912) (owner: 10Anne Tomasevich) [11:58:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2073.codfw.wmnet with reason: host reimage [11:59:43] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1010 [12:00:31] (03PS1) 10Mvolz: Update chart metadata for zotero and citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250559 (https://phabricator.wikimedia.org/T412693) [12:01:34] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1010 [12:01:55] PROBLEM - Host ms-be2073 is DOWN: PING CRITICAL - Packet loss = 100% [12:02:16] (03PS2) 10Mvolz: Update chart metadata for zotero and citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250559 (https://phabricator.wikimedia.org/T412693) [12:03:52] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - Status - issue on wdqs1025:9290 - https://phabricator.wikimedia.org/T419664#11697093 (10Jclark-ctr) a:03Jclark-ctr [12:04:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir4003.ulsfo.wmnet with OS bookworm [12:05:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11697102 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ncredir4003.ulsfo.wmnet with OS bookworm [12:05:20] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2073.codfw.wmnet with reason: host reimage [12:06:15] (03PS3) 10Mvolz: Update chart metadata for zotero and citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250559 (https://phabricator.wikimedia.org/T412693) [12:06:57] RECOVERY - Host ms-be2073 is UP: PING OK - Packet loss = 0%, RTA = 30.30 ms [12:10:42] (03CR) 10Muehlenhoff: [C:03+2] Update netflow collector for ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1250505 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [12:11:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4003.ulsfo.wmnet [12:14:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2072.codfw.wmnet with OS bullseye [12:15:03] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11697115 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2072.codfw.wmnet with OS bullseye compl... [12:17:06] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1011 [12:17:37] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1011 [12:18:33] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1011.eqiad.wmnet with OS bookworm [12:18:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4003.ulsfo.wmnet [12:21:10] (03PS4) 10Mvolz: Update chart metadata for zotero and citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250559 (https://phabricator.wikimedia.org/T412693) [12:22:01] (03PS5) 10Mvolz: Update chart metadata for zotero and citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250559 (https://phabricator.wikimedia.org/T412693) [12:23:54] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [12:24:24] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2073.codfw.wmnet with OS bullseye [12:24:31] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11697142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2073.codfw.wmnet with OS bullseye compl... [12:26:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1250517 (owner: 10Muehlenhoff) [12:28:15] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [12:28:28] (03PS1) 10Tiziano Fogli: Revert "prometheus::pop: enable rsyncd on ulsfo" [puppet] - 10https://gerrit.wikimedia.org/r/1250565 [12:28:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir4003.ulsfo.wmnet with reason: host reimage [12:28:38] (03PS2) 10Muehlenhoff: Add ncredir4003/ncredir4004 [puppet] - 10https://gerrit.wikimedia.org/r/1250506 (https://phabricator.wikimedia.org/T418993) [12:28:41] (03CR) 10Tiziano Fogli: [C:03+2] Revert "prometheus::pop: enable rsyncd on ulsfo" [puppet] - 10https://gerrit.wikimedia.org/r/1250565 (owner: 10Tiziano Fogli) [12:30:45] (03PS3) 10Jdlrobson: Restore advanced main menu for AMC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250566 (https://phabricator.wikimedia.org/T413912) [12:30:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250566 (https://phabricator.wikimedia.org/T413912) (owner: 10Jdlrobson) [12:31:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419000#11697150 (10Jclark-ctr) Drive has been replaced Currently rebuilding [12:32:53] (03PS1) 10Tiziano Fogli: prom4003: clean up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1250567 (https://phabricator.wikimedia.org/T419430) [12:34:01] (03CR) 10Tiziano Fogli: [C:03+2] prom4003: clean up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1250567 (https://phabricator.wikimedia.org/T419430) (owner: 10Tiziano Fogli) [12:34:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir4003.ulsfo.wmnet with reason: host reimage [12:35:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11697156 (10Jclark-ctr) @BTullis Failed Drive has been replaced [12:35:48] !log completed migration from prometheus4002 to prometheus4003 (ulsfo) (TT419430) [12:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host durum4003.ulsfo.wmnet [12:36:36] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:37:14] !log installing inetutils security updates [12:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:40] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum4003.ulsfo.wmnet - jmm@cumin2002" [12:42:22] (03PS1) 10Jdlrobson: Fix pinnableElement export [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250568 (https://phabricator.wikimedia.org/T419620) [12:44:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250568 (https://phabricator.wikimedia.org/T419620) (owner: 10Jdlrobson) [12:44:45] jmm@cumin2002 makevm (PID 1486798) is awaiting input [12:44:58] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:45:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum4003.ulsfo.wmnet - jmm@cumin2002" [12:45:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:45:54] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache durum4003.ulsfo.wmnet on all recursors [12:45:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum4003.ulsfo.wmnet on all recursors [12:46:27] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum4003.ulsfo.wmnet - jmm@cumin2002" [12:46:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum4003.ulsfo.wmnet - jmm@cumin2002" [12:46:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host durum4003.ulsfo.wmnet with OS trixie [12:48:03] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1205 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [12:49:58] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:51:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir4003.ulsfo.wmnet with OS bookworm [12:51:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11697186 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ncredir4003.ulsfo.wmnet with OS bookworm completed: - ncredi... [12:54:00] (03CR) 10Clément Goubert: trafficserver: Support fractional routing for api.w.o (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1245389 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [12:54:20] (03CR) 10Majavah: [C:03+2] P:openldap_clouddev: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1249995 (owner: 10Majavah) [12:57:02] (03PS1) 10Jelto: gitlab: de-duplicate active_host checks [puppet] - 10https://gerrit.wikimedia.org/r/1250571 [12:57:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir4004.ulsfo.wmnet with OS bookworm [12:57:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11697201 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ncredir4004.ulsfo.wmnet with OS bookworm [12:58:08] (03PS1) 10Elukey: profile::logstash: drop kserve-controller's logs [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T1300). [13:00:05] sfaci and jdlrobson: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:29] (03CR) 10CI reject: [V:04-1] profile::logstash: drop kserve-controller's logs [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [13:00:35] !log installing libcommons-lang3-java security updates [13:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:02] o/ [13:03:05] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8250/console" [puppet] - 10https://gerrit.wikimedia.org/r/1250571 (owner: 10Jelto) [13:03:14] (03CR) 10Muehlenhoff: phab_deploy_finalize: Remove support for Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248466 (owner: 10Muehlenhoff) [13:03:16] (03CR) 10Muehlenhoff: [C:03+2] phab_deploy_finalize: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1248466 (owner: 10Muehlenhoff) [13:03:30] o/ [13:03:39] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:24] sfaci: do you want to deploy your change or shall I do it? [13:05:18] Can you do it please? I can't deploy [13:05:23] Thank you! [13:05:25] (03PS2) 10Elukey: profile::logstash: drop kserve-controller's logs [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) [13:05:32] sfaci: sure thingl [13:05:36] Thanks!!!! [13:05:55] (03CR) 10Vgutierrez: [C:03+1] Add ncredir4003/ncredir4004 [puppet] - 10https://gerrit.wikimedia.org/r/1250506 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [13:06:04] sfaci: just https://gerrit.wikimedia.org/r/c/1247547/ right? [13:07:03] Yes! [13:07:04] sfaci: if so I can do yours first [13:07:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) (owner: 10Santiago Faci) [13:07:24] (03CR) 10Elukey: "forgot to add the tests, but samples are here https://logstash.wikimedia.org/goto/182fff55e1474600719ddc8d63d42647 if anybody has time" [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [13:07:24] cool [13:08:06] (03Merged) 10jenkins-bot: Remove `MetricsPlatform` configuration from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) (owner: 10Santiago Faci) [13:08:41] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1247547|Remove `MetricsPlatform` configuration from production (T416865)]] [13:08:45] T416865: Remove references to MetricsPlatform extension - https://phabricator.wikimedia.org/T416865 [13:09:24] (03CR) 10Vgutierrez: [C:03+1] trafficserver: Support fractional routing for api.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1245389 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [13:13:10] (03PS1) 10Harroyo-wmf: hcaptcha: Enforce hCaptcha checks on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) [13:13:32] (03CR) 10CI reject: [V:04-1] hcaptcha: Enforce hCaptcha checks on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [13:13:33] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4003.ulsfo.wmnet with reason: host reimage [13:14:10] (03PS2) 10Harroyo-wmf: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) [13:14:17] (03PS1) 10Ssingh: wmflib/dnsrecursor: add function for fetching auth nameservers addresses [puppet] - 10https://gerrit.wikimedia.org/r/1250576 [13:14:23] (03CR) 10CI reject: [V:04-1] hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [13:15:37] (03CR) 10JMeybohm: [C:04-1] "SRELBBatchRunner will also take care or (re-)pooling nodes after the action has completed. AIUI this will no longer happen with this chang" [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) (owner: 10Blake) [13:15:44] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8251/co" [puppet] - 10https://gerrit.wikimedia.org/r/1250576 (owner: 10Ssingh) [13:17:27] (03PS3) 10Harroyo-wmf: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) [13:18:14] (03CR) 10CI reject: [V:04-1] hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [13:18:24] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir4004.ulsfo.wmnet with reason: host reimage [13:18:39] (03PS4) 10Harroyo-wmf: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) [13:18:39] sfaci: still building container images.. [13:18:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4003.ulsfo.wmnet with reason: host reimage [13:18:55] (03CR) 10CI reject: [V:04-1] hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [13:19:12] ok! [13:20:04] (03PS2) 10Ssingh: wmflib/dnsrecursor: add function for fetching auth nameservers addresses [puppet] - 10https://gerrit.wikimedia.org/r/1250576 [13:21:27] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8252/co" [puppet] - 10https://gerrit.wikimedia.org/r/1250576 (owner: 10Ssingh) [13:22:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir4004.ulsfo.wmnet with reason: host reimage [13:28:38] (03CR) 10Ssingh: hcaptcha: Enable nginx caching for secure-api.js (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1249929 (https://phabricator.wikimedia.org/T418865) (owner: 10Kosta Harlan) [13:28:40] (03PS1) 10Anzx: urwikisource: add logo, sitename and projectnamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250579 (https://phabricator.wikimedia.org/T415974) [13:29:28] (03PS1) 10Jsn.sherman: riskyArticleEdits: show page descriptions [extensions/PersonalDashboard] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250581 (https://phabricator.wikimedia.org/T419442) [13:29:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/PersonalDashboard] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250581 (https://phabricator.wikimedia.org/T419442) (owner: 10Jsn.sherman) [13:29:49] !log jdlrobson@deploy2002 jdlrobson, sfaci: Backport for [[gerrit:1247547|Remove `MetricsPlatform` configuration from production (T416865)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:29:52] T416865: Remove references to MetricsPlatform extension - https://phabricator.wikimedia.org/T416865 [13:30:12] sfaci: please test on debug servers and let me know when to proceed with deployment. [13:30:22] (03PS1) 10Jsn.sherman: Fix Instrumentation on mobile view [extensions/PersonalDashboard] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250582 (https://phabricator.wikimedia.org/T419517) [13:30:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/PersonalDashboard] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250582 (https://phabricator.wikimedia.org/T419517) (owner: 10Jsn.sherman) [13:30:41] Jdlrobson: there is nothing to test. The extension was already disabled and we are just removing unused configuration. You can proceed with deployment [13:30:52] !log jdlrobson@deploy2002 jdlrobson, sfaci: Continuing with sync [13:31:08] (03CR) 10Phuedx: [C:03+1] Remove mpic redirects to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1250250 (https://phabricator.wikimedia.org/T415845) (owner: 10Clare Ming) [13:31:56] Jdlrobson: i have a config to be deployed https://gerrit.wikimedia.org/r/1250579 , if you can deploy it i will add it to calendar [13:32:13] (03CR) 10Klausman: [C:03+1] "I am not sure there is an easy way to test for absence of messages, but I'll have a look." [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [13:35:03] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11697389 (10ayounsi) yeah it's planned with `profile::server_pool` (and the same keys), focusing on the depool for now, especially for the `show` command. [13:35:27] (03CR) 10Arnaudb: [C:03+1] "the active/passive split is much easier to follow now, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1250571 (owner: 10Jelto) [13:35:55] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [13:36:10] (03PS3) 10Kosta Harlan: hcaptcha: Enable nginx caching for secure-api.js [puppet] - 10https://gerrit.wikimedia.org/r/1249929 (https://phabricator.wikimedia.org/T418865) [13:36:15] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [13:36:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4003.ulsfo.wmnet with OS trixie [13:36:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum4003.ulsfo.wmnet [13:36:21] (03CR) 10Kosta Harlan: hcaptcha: Enable nginx caching for secure-api.js (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1249929 (https://phabricator.wikimedia.org/T418865) (owner: 10Kosta Harlan) [13:36:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host durum4004.ulsfo.wmnet [13:36:43] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:36:44] anzx: for some reason deployments are very slow today [13:36:50] 40m to do a config change. I have 2 changes to go. [13:37:01] I don't think there's going to be time. [13:37:28] Jdlrobson: ok i can schedule it for next window [13:37:33] :( sorry about that! [13:37:34] btullis@cumin1003 reimage (PID 2854388) is awaiting input [13:39:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir4004.ulsfo.wmnet with OS bookworm [13:39:28] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11697422 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ncredir4004.ulsfo.wmnet with OS bookworm completed: - ncredi... [13:39:46] (03CR) 10Nikerabbit: machinetranslation: Optimize model loading and memory footprints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248388 (https://phabricator.wikimedia.org/T411058) (owner: 10KartikMistry) [13:41:13] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11697433 (10Jclark-ctr) Drive has been Replaced. thank you! [13:41:25] (03PS1) 10Brouberol: kafka-mirrormaker: only deploy jumbo-eqiad->test-eqiad to eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250586 (https://phabricator.wikimedia.org/T417407) [13:41:28] (03PS1) 10Brouberol: kafka-mirrormaker: migrate logging-{eqiad,codfw}->jumbo-eqiad to aux-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250587 (https://phabricator.wikimedia.org/T417407) [13:42:11] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1011.eqiad.wmnet with OS bookworm [13:42:23] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [13:42:24] jmm@cumin2002 makevm (PID 1510583) is awaiting input [13:43:23] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:44:19] (03CR) 10Elukey: "the tests are under profile logstash tests, see https://wikitech.wikimedia.org/wiki/Logstash#Writing_&_testing_filters. I forgot about tho" [puppet] - 10https://gerrit.wikimedia.org/r/1250573 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [13:44:33] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247547|Remove `MetricsPlatform` configuration from production (T416865)]] (duration: 35m 52s) [13:44:36] T416865: Remove references to MetricsPlatform extension - https://phabricator.wikimedia.org/T416865 [13:44:51] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:45:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250566 (https://phabricator.wikimedia.org/T413912) (owner: 10Jdlrobson) [13:45:40] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:45:40] (03PS1) 10MVernon: swift: add 5 codfw backends [puppet] - 10https://gerrit.wikimedia.org/r/1250591 (https://phabricator.wikimedia.org/T354872) [13:46:27] (03Merged) 10jenkins-bot: Restore advanced main menu for AMC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250566 (https://phabricator.wikimedia.org/T413912) (owner: 10Jdlrobson) [13:46:54] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1250566|Restore advanced main menu for AMC (T413912)]] [13:46:59] T413912: Deployment: Promote advanced user menu to all users - https://phabricator.wikimedia.org/T413912 [13:49:04] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum4004.ulsfo.wmnet - jmm@cumin2002" [13:49:19] !log depool cp7016 [13:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:55] !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1250566|Restore advanced main menu for AMC (T413912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:51:28] !log jdlrobson@deploy2002 jdlrobson: Continuing with sync [13:52:10] jmm@cumin2002 makevm (PID 1510583) is awaiting input [13:52:22] (03CR) 10Federico Ceratto: [C:03+1] "The hostnames in the yaml files matche the description and the 2 related tasks." [puppet] - 10https://gerrit.wikimedia.org/r/1250591 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [13:54:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum4004.ulsfo.wmnet - jmm@cumin2002" [13:54:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:54:03] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache durum4004.ulsfo.wmnet on all recursors [13:54:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum4004.ulsfo.wmnet on all recursors [13:54:35] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum4004.ulsfo.wmnet - jmm@cumin2002" [13:54:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum4004.ulsfo.wmnet - jmm@cumin2002" [13:54:42] !log repool cp7016 [13:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:51] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:55:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host durum4004.ulsfo.wmnet with OS trixie [13:56:05] Is the "Wikifunctions Services UTC Afternoon" being used today? I might need to run over since the last change is a train blocker [13:57:38] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1250566|Restore advanced main menu for AMC (T413912)]] (duration: 10m 44s) [13:57:42] T413912: Deployment: Promote advanced user menu to all users - https://phabricator.wikimedia.org/T413912 [13:57:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250568 (https://phabricator.wikimedia.org/T419620) (owner: 10Jdlrobson) [13:57:56] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-02-28-010106 to 2026-03-10-214300 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250594 (https://phabricator.wikimedia.org/T416756) [13:58:01] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-03-04-220825 to 2026-03-10-224034 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250595 (https://phabricator.wikimedia.org/T327412) [13:58:04] !log uploaded libxml2 2.9.10+dfsg-6.7+deb11u9+wmf11u1 to component/php83-icu72 for bullseye-wikimedia (special build of libxml with ICU disabled to ensure co-installabiliy between icu 67 and icu 72) T419058 [13:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:08] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [13:58:37] (03CR) 10Effie Mouzeli: [C:03+1] Update chart metadata for zotero and citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250559 (https://phabricator.wikimedia.org/T412693) (owner: 10Mvolz) [13:59:44] (03Merged) 10jenkins-bot: Fix pinnableElement export [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250568 (https://phabricator.wikimedia.org/T419620) (owner: 10Jdlrobson) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T1400) [14:00:11] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-02-28-010106 to 2026-03-10-214300 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250594 (https://phabricator.wikimedia.org/T416756) (owner: 10Jforrester) [14:00:43] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1250568|Fix pinnableElement export (T419620)]] [14:00:47] T419620: [beta cluster] "TypeError: hasPinnedElementsFn is not a function" warning - https://phabricator.wikimedia.org/T419620 [14:00:54] (03CR) 10Ssingh: hcaptcha: Enable nginx caching for secure-api.js (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1249929 (https://phabricator.wikimedia.org/T418865) (owner: 10Kosta Harlan) [14:02:24] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-02-28-010106 to 2026-03-10-214300 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250594 (https://phabricator.wikimedia.org/T416756) (owner: 10Jforrester) [14:02:30] (03PS1) 10Effie Mouzeli: Update chart metadata for various charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250597 (https://phabricator.wikimedia.org/T412693) [14:02:36] James_F: sorry for overrunning [14:02:46] Jdlrobson: No worries, we don't actually conflict. [14:02:47] !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1250568|Fix pinnableElement export (T419620)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:02:56] k8s services are rather isolated from MW stuff. [14:02:58] ok cool im almost wrapped up [14:03:06] (03PS6) 10Mvolz: Update chart metadata for zotero and citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250559 (https://phabricator.wikimedia.org/T412693) [14:03:07] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:03:13] !log jdlrobson@deploy2002 jdlrobson: Continuing with sync [14:03:44] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:04:09] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:04:31] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy latest version of edit-check model on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250546 (https://phabricator.wikimedia.org/T419527) (owner: 10Gkyziridis) [14:04:52] (03CR) 10KartikMistry: machinetranslation: Optimize model loading and memory footprints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248388 (https://phabricator.wikimedia.org/T411058) (owner: 10KartikMistry) [14:04:53] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:05:02] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:05:43] (03CR) 10Effie Mouzeli: [C:03+2] Update chart metadata for zotero and citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250559 (https://phabricator.wikimedia.org/T412693) (owner: 10Mvolz) [14:06:04] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:06:36] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-03-04-220825 to 2026-03-10-224034 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250595 (https://phabricator.wikimedia.org/T327412) (owner: 10Jforrester) [14:06:44] (03Merged) 10jenkins-bot: ml-services: Deploy latest version of edit-check model on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250546 (https://phabricator.wikimedia.org/T419527) (owner: 10Gkyziridis) [14:07:09] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1250568|Fix pinnableElement export (T419620)]] (duration: 06m 26s) [14:07:12] T419620: [beta cluster] "TypeError: hasPinnedElementsFn is not a function" warning - https://phabricator.wikimedia.org/T419620 [14:08:11] !log gkyziridis@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:08:16] (03Merged) 10jenkins-bot: Update chart metadata for zotero and citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250559 (https://phabricator.wikimedia.org/T412693) (owner: 10Mvolz) [14:08:23] !log gkyziridis@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:09:02] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-03-04-220825 to 2026-03-10-224034 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250595 (https://phabricator.wikimedia.org/T327412) (owner: 10Jforrester) [14:10:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419000#11697599 (10Jclark-ctr) 05Open→03Resolved [14:10:36] (03PS1) 10Daniel Kinzler: rest-gateway rate limit: add DENY policy and class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 [14:10:49] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:10:54] (03PS1) 10Muehlenhoff: php8.3-icu72: Create new ICU 72 flavored image stack [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249981 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [14:10:54] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one not inline that we should doublecheck after the initial test image is built." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249981 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [14:11:09] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:11:16] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:11:26] (done) [14:11:28] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:11:46] (03CR) 10Scott French: "Thanks, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1249522 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [14:11:56] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:12:28] (03CR) 10Scott French: [C:03+2] aptrepo: add pcre2 updates for component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1249522 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [14:12:32] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:12:41] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:12:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10Recommendation-API, 13Patch-For-Review: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11697604 (10Jclark-ctr) Thanks @elukey for the help [14:12:54] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:12:58] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:13:01] (03CR) 10Muehlenhoff: [C:03+2] Add ncredir4003/ncredir4004 [puppet] - 10https://gerrit.wikimedia.org/r/1250506 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [14:13:27] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:13:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Recommendation-API, 13Patch-For-Review: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11697621 (10Jclark-ctr) [14:13:39] (03PS1) 10Scott French: php8.3-icu72: Clone php8.3 image stack [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249980 (https://phabricator.wikimedia.org/T419058) [14:13:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10Recommendation-API, 13Patch-For-Review: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11697624 (10Jclark-ctr) 05Stalled→03Resolved [14:14:07] (03CR) 10Muehlenhoff: "(The ipip support is still TBD, but migrating one more firewall service on the way)" [puppet] - 10https://gerrit.wikimedia.org/r/1250517 (owner: 10Muehlenhoff) [14:14:58] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:14:59] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [14:15:58] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Wed 08 Apr 2026 01:39:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [14:18:38] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4004.ulsfo.wmnet with reason: host reimage [14:19:44] !log installing python-urllib3 security updates [14:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:24] (03CR) 10MVernon: [C:03+2] swift: add 5 codfw backends [puppet] - 10https://gerrit.wikimedia.org/r/1250591 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [14:22:20] (03CR) 10Scott French: "Yes, once we have all the packages ready, I'll build this locally before merging to confirm we get the expected results." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249981 (https://phabricator.wikimedia.org/T419058) (owner: 10Scott French) [14:22:23] (03PS1) 10Jelto: gerrit: fix failing discovery dns lookup in test spec [puppet] - 10https://gerrit.wikimedia.org/r/1250601 (https://phabricator.wikimedia.org/T411895) [14:23:05] (03PS2) 10Jelto: gerrit: fix failing discovery dns lookup in test spec [puppet] - 10https://gerrit.wikimedia.org/r/1250601 (https://phabricator.wikimedia.org/T411895) [14:23:11] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:23:13] (03CR) 10Michael Große: "I think this should now be ready to move forward, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518) (owner: 10Urbanecm) [14:23:23] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:24:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4004.ulsfo.wmnet with reason: host reimage [14:24:25] (03PS6) 10Blake: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) [14:27:03] (03CR) 10Urbanecm: cleanup: Growth: Remove temporary GrowthMentorList overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518) (owner: 10Urbanecm) [14:27:17] !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=ncredir4003.ulsfo.wmnet [14:27:42] !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir4003.ulsfo.wmnet [14:27:52] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8253/console" [puppet] - 10https://gerrit.wikimedia.org/r/1250601 (https://phabricator.wikimedia.org/T411895) (owner: 10Jelto) [14:28:40] (03CR) 10Urbanecm: "Agreed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518) (owner: 10Urbanecm) [14:29:25] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11697835 (10MatthewVernon) [14:29:49] (03CR) 10Jelto: [V:03+1] "`./utils/run_ci_locally.sh` fails because of the discovery dns lookup. I tried to mock just the dns lookup but failed. So this change defi" [puppet] - 10https://gerrit.wikimedia.org/r/1250601 (https://phabricator.wikimedia.org/T411895) (owner: 10Jelto) [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T1400) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T1430) [14:30:07] (03CR) 10CI reject: [V:04-1] sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) (owner: 10Blake) [14:30:49] !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=ncredir4004.ulsfo.wmnet [14:30:54] !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir4004.ulsfo.wmnet [14:31:01] !log jmm@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir4001.ulsfo.wmnet [14:31:07] !log jmm@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir4002.ulsfo.wmnet [14:31:46] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11697857 (10ssingh) Hi folks. I confirmed with Valentin that we don't need the public IPs, `pybal-high-traffic1-ulsfo.wikimedia.org` and `pybal-high-tra... [14:32:24] (03CR) 10BBlack: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1250576 (owner: 10Ssingh) [14:33:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11697880 (10MoritzMuehlenhoff) [14:34:30] FIRING: LibericaDiffFPCheck: Liberica instance lvs4010:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?var-site=ulsfo&var-instance=lvs4010 - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck [14:34:50] uh? [14:34:59] I'm guessing that's ncredir@ulsfo [14:35:03] ahh [14:35:14] definitely a new one for liberica :D [14:35:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service ncredir-httpslb6_443 has 1 unhealthy realservers pooled on lvs4008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [14:35:58] 2620:0:863:103:10:128:2:5 1 healthy: false | pooled: force-repool-failed [14:36:00] new instances have been pooled and old ones removed and IPv6 is down there [14:36:01] :/ [14:36:07] moritzm: ^^ [14:37:38] (03CR) 10Btullis: [C:03+2] deployment_server: Add wdqs-queryhammer service [puppet] - 10https://gerrit.wikimedia.org/r/1249918 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [14:38:26] ok.. I'm repooling the old instances and depooling the new ones [14:38:36] (03PS2) 10Herron: mwlog: remove notion of primary/secondary [puppet] - 10https://gerrit.wikimedia.org/r/1250014 (https://phabricator.wikimedia.org/T417002) [14:38:38] ok. worse case, we can depool ncredir in ulsfo as well [14:38:45] !log repool ncredir4001 && ncredir4002 [14:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:52] vgutierrez: ah, sorry. we'll investigate [14:39:30] FIRING: [2x] LibericaDiffFPCheck: Liberica instance lvs4008:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck [14:39:35] !log depool ncredir4003 && ncredir4004 [14:39:36] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11697929 (10ssingh) Sorry, @ayounsi reminded me that the main purpose of this task is to figure out what to do about the other public IPs. We will need... [14:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:28] (03PS1) 10Herron: mwlog: remove mwlog[12]002 from udp tee stream [puppet] - 10https://gerrit.wikimedia.org/r/1250606 (https://phabricator.wikimedia.org/T417002) [14:40:30] RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service ncredir-httpslb6_443 has 1 unhealthy realservers pooled on lvs4008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [14:41:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4004.ulsfo.wmnet with OS trixie [14:41:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum4004.ulsfo.wmnet [14:42:04] what was the issue? [14:42:30] or still unknown? [14:43:03] jynus: service was left with only unhealthy realservers pooled [14:43:04] (03CR) 10Nikerabbit: machinetranslation: Optimize model loading and memory footprints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248388 (https://phabricator.wikimedia.org/T411058) (owner: 10KartikMistry) [14:43:14] (03CR) 10Herron: [C:03+2] mwlog: remove notion of primary/secondary [puppet] - 10https://gerrit.wikimedia.org/r/1250014 (https://phabricator.wikimedia.org/T417002) (owner: 10Herron) [14:43:36] (03CR) 10Herron: [C:03+2] mwlog: remove mwlog[12]002 from udp tee stream [puppet] - 10https://gerrit.wikimedia.org/r/1250606 (https://phabricator.wikimedia.org/T417002) (owner: 10Herron) [14:44:30] RESOLVED: [2x] LibericaDiffFPCheck: Liberica instance lvs4008:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck [14:45:07] 10ops-magru, 06DC-Ops, 06Traffic: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11697954 (10ssingh) @BCornwall: DC-Ops has recommended in the past to try rebooting the server again to see if the issue resolves. I am not saying it is the same but perhaps ca... [14:45:23] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [14:45:44] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [14:46:06] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:48:12] 06SRE, 10SRE-swift-storage, 10Observability-Metrics: thanos swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419713 (10MatthewVernon) 03NEW [14:50:23] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: DHCP failing for at least 2 ms-be servers in codfw - https://phabricator.wikimedia.org/T415189#11698029 (10MatthewVernon) 05Open→03Resolved @ayounsi I re-imaged with the `--move-vlan` argument 3 codfw nodes today, an... [14:51:18] gerrit awol? [14:52:31] FIRING: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:52:50] there we go [14:53:12] (03CR) 10KartikMistry: machinetranslation: Optimize model loading and memory footprints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248388 (https://phabricator.wikimedia.org/T411058) (owner: 10KartikMistry) [14:53:25] !log updated component/php83-icu72 with libpcre2 10.42-1~wmf11+1 from apt-staging - T419058 [14:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:29] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [14:54:57] vgutierrez: what's not working for ncredir4003/4004 ? trying to troubleshot it, but I don't see any issue [14:55:19] XioNoX: IP6IP6 traffic [14:55:35] (03PS2) 10Effie Mouzeli: Update chart metadata for various charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250597 (https://phabricator.wikimedia.org/T412693) [14:55:39] FIRING: CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (2a02:ec80:300:fe09::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:56:09] vgutierrez: how can I test it? [14:56:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:57:18] (03CR) 10Elukey: [C:03+1] kafka-mirrormaker: only deploy jumbo-eqiad->test-eqiad to eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250586 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [14:57:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:58:39] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:58:55] (03PS1) 10MVernon: swift: add 4 new codfw frontends [puppet] - 10https://gerrit.wikimedia.org/r/1250609 (https://phabricator.wikimedia.org/T416243) [14:59:33] (03CR) 10Elukey: [C:03+1] kafka-mirrormaker: migrate logging-{eqiad,codfw}->jumbo-eqiad to aux-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250587 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [14:59:58] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11698105 (10MatthewVernon) Thanks :) [14:59:58] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:59:58] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:00:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:00:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:01:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:01:23] XioNoX: hmm give me a few minutes [15:01:27] thx [15:03:22] (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: only deploy jumbo-eqiad->test-eqiad to eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250586 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:03:39] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:04:32] (03PS1) 10Muehlenhoff: Add durum4003/durum4004 as new nodes [puppet] - 10https://gerrit.wikimedia.org/r/1250613 (https://phabricator.wikimedia.org/T418993) [15:05:31] (03Merged) 10jenkins-bot: kafka-mirrormaker: only deploy jumbo-eqiad->test-eqiad to eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250586 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:05:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:08:25] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [15:08:36] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [15:15:32] (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: migrate logging-{eqiad,codfw}->jumbo-eqiad to aux-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250587 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:17:30] (03Merged) 10jenkins-bot: kafka-mirrormaker: migrate logging-{eqiad,codfw}->jumbo-eqiad to aux-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250587 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:21:57] (03PS1) 10Ayounsi: network/data.yaml: update ulsfo network infra range [puppet] - 10https://gerrit.wikimedia.org/r/1250616 (https://phabricator.wikimedia.org/T408892) [15:26:31] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T419712 [15:30:00] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:30:00] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:30:00] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:30:31] XioNoX: ok.. I got it, sorry for the delay [15:30:36] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:30:54] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Mon 30 Mar 2026 11:52:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:30:54] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 4.695 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:30:54] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 4.697 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:31:07] vgutierrez: no rush, what's up? [15:31:15] (03PS1) 10Brouberol: Revert "kafka-mirrormaker: migrate logging-{eqiad,codfw}->jumbo-eqiad to aux-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250618 [15:31:26] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Mon 30 Mar 2026 11:52:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:31:35] XioNoX: I mean I got how to check IP6IP6 [15:31:47] (03PS1) 10Thcipriani: Blubber: add python3-setuptools + use-system-packages [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1250619 (https://phabricator.wikimedia.org/T418253) [15:31:56] vgutierrez: yeah, I'm all ears to help troubleshot that issue [15:32:51] XioNoX: so I'm testing with https://gitlab.wikimedia.org/-/snippets/282 from bast6003 [15:33:39] looking good for ncredir4001 https://www.irccloud.com/pastebin/KLlr2QAI/ [15:34:12] failing for ncredir4003 https://www.irccloud.com/pastebin/ELt9LUrw/ [15:34:53] (03CR) 10Ssingh: [C:03+1] "Looks good thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1250613 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [15:35:49] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T419712 [15:36:23] vgutierrez: how can I reproduce? [15:36:32] btullis@cumin1003 reimage (PID 2871895) is awaiting input [15:36:41] I want to do some packet captures to see where the issue is [15:39:39] !log sudo cumin "C:dnsrecursor" "disable-puppet 'merging CR 1250576'" [15:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:09] (03PS1) 10Btullis: Enable UEFI for dse-k8s-worker1010 [puppet] - 10https://gerrit.wikimedia.org/r/1250620 (https://phabricator.wikimedia.org/T414787) [15:41:16] (03CR) 10Btullis: [C:03+2] Enable UEFI for dse-k8s-worker1010 [puppet] - 10https://gerrit.wikimedia.org/r/1250620 (https://phabricator.wikimedia.org/T414787) (owner: 10Btullis) [15:42:14] (03CR) 10CI reject: [V:04-1] Enable UEFI for dse-k8s-worker1010 [puppet] - 10https://gerrit.wikimedia.org/r/1250620 (https://phabricator.wikimedia.org/T414787) (owner: 10Btullis) [15:42:30] (03CR) 10Ssingh: [V:03+1 C:03+2] wmflib/dnsrecursor: add function for fetching auth nameservers addresses [puppet] - 10https://gerrit.wikimedia.org/r/1250576 (owner: 10Ssingh) [15:43:01] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T419712 [15:43:45] (03Abandoned) 10Btullis: Enable UEFI for dse-k8s-worker1010 [puppet] - 10https://gerrit.wikimedia.org/r/1250620 (https://phabricator.wikimedia.org/T414787) (owner: 10Btullis) [15:44:08] (03CR) 10Brouberol: [C:03+2] Revert "kafka-mirrormaker: migrate logging-{eqiad,codfw}->jumbo-eqiad to aux-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250618 (owner: 10Brouberol) [15:45:28] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [15:45:36] (03PS5) 10Urbanecm: cleanup: Growth: Remove temporary GrowthMentorList overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518) [15:45:55] (03Merged) 10jenkins-bot: Revert "kafka-mirrormaker: migrate logging-{eqiad,codfw}->jumbo-eqiad to aux-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250618 (owner: 10Brouberol) [15:46:02] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:46:30] PROBLEM - Host wikikube-worker2332 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:58] RECOVERY - Host wikikube-worker2332 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [15:48:39] FIRING: JobUnavailable: Reduced availability for job thanos-sidecar in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:48:45] !log urbanecm@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [15:48:50] !log sudo cumin -b1 -s10 "C:dnsrecursor" "run-puppet-agent --enable 'merging CR 1250576'" [15:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:23] !log urbanecm@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [15:50:07] btullis: do you have something to do with the uncommited changes under /srv/deployment-charts on deploy2002? [15:50:16] nevefr mind, they disappeared [15:50:28] btullis@cumin1003 provision (PID 2879009) is awaiting input [15:50:34] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:50:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [15:50:59] (03PS1) 10Majavah: dnsrecursor: Use a proper data type for forward zone data [puppet] - 10https://gerrit.wikimedia.org/r/1250626 [15:51:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [15:51:11] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2332:9290 - https://phabricator.wikimedia.org/T419462#11698379 (10Jhancock.wm) reseated and secured both power cables. alert cleared. [15:51:17] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2332:9290 - https://phabricator.wikimedia.org/T419462#11698380 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:51:28] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11698383 (10ssingh) @Jgreen / @Dwisehaupt: `donate-lb.ulsfo.wikimedia.org` is the same IP as `text-lb.ulsfo.wikimedia.org` and that will change as part... [15:51:35] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2333:9290 - https://phabricator.wikimedia.org/T419465#11698385 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated and secured both power cables. alert cleared. [15:51:57] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T419712 [15:52:34] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7011.magru.wmnet with OS trixie [15:53:51] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1250626 (owner: 10Majavah) [15:55:59] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11698428 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm i need to find a way to wipe it without root access. I'd like to able to fix that issue without pulli... [15:56:21] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11698432 (10Jhancock.wm) [15:57:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:59:14] btullis@cumin1003 provision (PID 2878936) is awaiting input [15:59:45] (03PS1) 10Brouberol: kafka-mirrormaker: allow multiple releases to be installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250631 (https://phabricator.wikimedia.org/T417407) [16:02:38] (03PS1) 10Santiago Faci: ext.wikimediaEvents: Updated Test Kitchen impact test experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250632 (https://phabricator.wikimedia.org/T407570) [16:05:25] (03PS8) 10Daniel Kinzler: rest-gateway rate limiting: add CORS support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) [16:12:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250632 (https://phabricator.wikimedia.org/T407570) (owner: 10Santiago Faci) [16:12:34] (03PS5) 10Effie Mouzeli: ipoid: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249249 (https://phabricator.wikimedia.org/T412693) [16:13:07] (03PS5) 10Effie Mouzeli: eventgate, eventstreams: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249241 (https://phabricator.wikimedia.org/T412693) [16:13:10] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1250616 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [16:13:34] (03CR) 10Ayounsi: [C:03+2] network/data.yaml: update ulsfo network infra range [puppet] - 10https://gerrit.wikimedia.org/r/1250616 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [16:16:03] (03PS1) 10Daniel Kinzler: rest gateways: tests: check heathz first [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250637 [16:18:40] !log tappof@cumin1003 START - Cookbook sre.hosts.decommission for hosts prometheus4002.ulsfo.wmnet [16:19:45] (03Abandoned) 10Jdlrobson: Revert "Enable personal main menu to all users in Minerva Neue skin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250558 (https://phabricator.wikimedia.org/T413912) (owner: 10Anne Tomasevich) [16:19:54] (03CR) 10Muehlenhoff: [C:03+2] Add durum4003/durum4004 as new nodes [puppet] - 10https://gerrit.wikimedia.org/r/1250613 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [16:19:58] FIRING: [3x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:20:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:20:51] (03CR) 10Effie Mouzeli: [C:03+2] ipoid: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249249 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [16:21:10] (03PS1) 10Andrew Bogott: Replace cloudgw2002-dev with cloudgw2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1250638 (https://phabricator.wikimedia.org/T418765) [16:22:06] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1250638 (https://phabricator.wikimedia.org/T418765) (owner: 10Andrew Bogott) [16:23:02] (03Merged) 10jenkins-bot: ipoid: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249249 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [16:23:32] !log tappof@cumin1003 START - Cookbook sre.dns.netbox [16:24:22] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11698537 (10RLazarus) Service Ops triage here: Agreed there's nothing for us to do, thanks @ayounsi - untagging us. [16:24:43] (03CR) 10Effie Mouzeli: [C:03+2] eventgate, eventstreams: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249241 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [16:25:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:25:43] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1011.eqiad.wmnet with OS bookworm [16:27:06] (03Merged) 10jenkins-bot: eventgate, eventstreams: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249241 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [16:28:39] FIRING: [3x] JobUnavailable: Reduced availability for job envoy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:29:16] tappof@cumin1003 decommission (PID 2882568) is awaiting input [16:29:52] btullis@cumin1003 provision (PID 2878936) is awaiting input [16:30:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:30:56] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7011.magru.wmnet with reason: host reimage [16:30:56] !log tappof@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus4002.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - tappof@cumin1003" [16:32:18] !log tappof@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus4002.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - tappof@cumin1003" [16:32:18] !log tappof@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:32:19] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus4002.ulsfo.wmnet [16:32:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:34:16] (03CR) 10Andrew Bogott: [C:03+2] Replace cloudgw2002-dev with cloudgw2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1250638 (https://phabricator.wikimedia.org/T418765) (owner: 10Andrew Bogott) [16:35:05] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#11698609 (10BTullis) Thanks for the additional context @Ben.buchenau. I think that there might be a bit of a misunderstanding here: > I checked... [16:35:22] !log root@cumin2002 START - Cookbook sre.dns.netbox [16:35:33] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7011.magru.wmnet with reason: host reimage [16:36:12] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:37:50] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [16:38:25] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11698640 (10RobH) >>! In T418411#11696664, @ayounsi wrote: > I added rough network numbers. Thank you for this! I'll be filling o... [16:39:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:39:38] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11698643 (10RobH) [16:39:42] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1011.eqiad.wmnet with reason: host reimage [16:40:13] !log root@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moving many things from cloudgw2002-dev to cloudgw2004-dev - root@cumin2002" [16:40:19] !log root@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moving many things from cloudgw2002-dev to cloudgw2004-dev - root@cumin2002" [16:40:19] !log root@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:44:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1011.eqiad.wmnet with reason: host reimage [16:46:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:46:37] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11698703 (10RobH) Apologies this was neglected. Since we need to likely give 24 hours notice for smart hands to avoid expedite fees, I suggest we schedule this for... [16:47:18] PROBLEM - Bird Internet Routing Daemon on durum4003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [16:47:33] btullis@cumin1003 reimage (PID 2884938) is awaiting input [16:48:16] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11698721 (10RobH) [16:51:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:53:40] btullis@cumin1003 reimage (PID 2884938) is awaiting input [16:54:58] FIRING: [3x] JobUnavailable: Reduced availability for job envoy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:56:05] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11698762 (10RobH) Remote hands ticket CS1254900 filed: > Support, > > Please schedule this work for Monday 2026-03-16 @ 11AM Brazil time, as we'll drain the link... [16:57:25] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11698779 (10RobH) p:05Triage→03High [16:58:39] RESOLVED: [3x] JobUnavailable: Reduced availability for job envoy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:58:39] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on durum4003.ulsfo.wmnet with reason: in setup [16:58:52] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on durum4004.ulsfo.wmnet with reason: in setup [16:59:37] (03PS4) 10Scott French: mw-(api-int|web): Pilot drain configuration in canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250067 (https://phabricator.wikimedia.org/T364245) [17:00:04] swfrench-wmf: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T1700). [17:00:15] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1011.eqiad.wmnet with OS bookworm [17:00:16] o/ [17:00:21] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11698790 (10RobH) a:05RobH→03ayounsi @ayounsi / @cmooney / @papaul, Not sure who wants to take point on this, but since I chatted briefly with Arzhel in IRC I'... [17:01:03] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1011.eqiad.wmnet with OS bookworm [17:01:08] 10ops-magru: Inbound errors on interface cr1-magru:xe-0/1/1 (Transport: cr2-eqiad:xe-1/0/1:3 (Telxius, CRT-008508) {#70089}) - https://phabricator.wikimedia.org/T413409#11698798 (10RobH) Ack >>! In T413409#11695960, @ayounsi wrote: > Rob, could you investigate those as well. Same as {T415743}. Please sync up wi... [17:01:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:02:05] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7011.magru.wmnet with OS trixie [17:04:05] (03CR) 10Scott French: [C:03+2] mw-(api-int|web): Pilot drain configuration in canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250067 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:06:10] (03Merged) 10jenkins-bot: mw-(api-int|web): Pilot drain configuration in canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250067 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:08:32] starting infra window work now [17:09:14] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:09:51] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:12:50] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:13:23] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:15:36] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1011.eqiad.wmnet with reason: host reimage [17:17:49] 10ops-magru, 06DC-Ops, 06Traffic: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11698841 (10RobH) a:03RobH [17:18:48] (03PS1) 10Btullis: Set dse-k8s-worker101[0-1] into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1250645 (https://phabricator.wikimedia.org/T414787) [17:18:50] 10ops-magru, 06DC-Ops, 06Traffic: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11698846 (10RobH) Dell will require all the firmware and such be the latest versions before they call it a failure, so I'll steal this and update the firmware on this host and... [17:18:59] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:19:36] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:19:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1011.eqiad.wmnet with reason: host reimage [17:19:58] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:20:25] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:22:13] (03CR) 10Btullis: [C:03+2] Set dse-k8s-worker101[0-1] into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1250645 (https://phabricator.wikimedia.org/T414787) (owner: 10Btullis) [17:23:00] btullis@cumin1003 reimage (PID 2888889) is awaiting input [17:27:17] 10ops-magru, 06DC-Ops, 06Traffic: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11698888 (10RobH) [17:27:44] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:28:23] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:28:48] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:29:20] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:31:52] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [17:32:07] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [17:34:32] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1011.eqiad.wmnet with OS bookworm [17:35:10] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [17:36:01] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:36:29] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:36:55] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:37:28] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:37:45] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:38:14] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:38:29] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:38:55] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:42:28] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:42:38] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:43:25] (03PS1) 10Bvibber: Revert "Fix for temp section open during slow loads on Parsoid" [extensions/MobileFrontend] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1250647 (https://phabricator.wikimedia.org/T416063) [17:43:49] (03PS1) 10Bvibber: Revert "Fix for temp section open during slow loads on Parsoid" [extensions/MobileFrontend] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250648 (https://phabricator.wikimedia.org/T416063) [17:45:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/MobileFrontend] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1250647 (https://phabricator.wikimedia.org/T416063) (owner: 10Bvibber) [17:45:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/MobileFrontend] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250648 (https://phabricator.wikimedia.org/T416063) (owner: 10Bvibber) [17:45:39] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [17:46:13] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1011.eqiad.wmnet with OS bookworm [17:47:43] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1010.eqiad.wmnet with reason: host reimage [17:48:44] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1011.eqiad.wmnet with reason: host reimage [17:49:58] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:50:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:51:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:52:28] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:52:40] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:53:39] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:54:58] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:55:23] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1010.eqiad.wmnet with reason: host reimage [17:55:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:56:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:56:48] alright, I'm done with the infra window. I'll be leaving some changes live in the mw-api-int and mw-web canary deployments to evaluate how they behave during deployments throughout the day [17:57:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250579 (https://phabricator.wikimedia.org/T415974) (owner: 10Anzx) [17:58:39] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:59:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1011.eqiad.wmnet with reason: host reimage [17:59:58] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:00:05] brennen and jeena: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T1800). [18:00:23] o/ [18:00:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:01:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:03:39] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:05:54] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:06:25] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:11:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11699094 (10Jclark-ctr) @VRiley-WMF @BTullis We do not have any 4tb that match this server and would require purchasing the replacement drive [18:13:15] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [18:14:58] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:16:20] btullis@cumin1003 reimage (PID 2892980) is awaiting input [18:16:29] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11699110 (10BCornwall) Some commands I used to get info: `lang=bash $ sudo -i cumin 'ganeti7* or lvs7* or cp7* or dns7*' 'grep "mo... [18:16:55] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-backup1004.eqiad.wmnet with OS trixie [18:17:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11699121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-backup1004.eqiad.wmnet with OS trixie [18:17:28] (03PS4) 10AKhatun: stream: mediawiki.page_edit_type_simple [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249360 (https://phabricator.wikimedia.org/T351225) [18:18:52] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [18:18:58] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11699130 (10BCornwall) [18:19:32] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11699140 (10BCornwall) [18:19:42] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11699141 (10BCornwall) [18:20:32] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7011.* [18:21:56] btullis@cumin1003 reimage (PID 2893005) is awaiting input [18:23:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11699156 (10VRiley-WMF) @BTullis this is correct. While we have 4TB drives, the ones we have are rated at 6Gbps. Upon inspection, the drives in an-presto10... [18:25:12] i'm noticing JobQueueError rates jumped up around 15:45 UTC. [18:26:06] (03PS1) 10Eevans: cassandra-http-gateway: new chart based on aqs-http-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112) [18:26:08] (03PS1) 10Eevans: charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) [18:26:11] (03PS1) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) [18:32:52] (03CR) 10Eevans: "This changeset is just a copy of aqs-http-gateway, with `s/aqs-http-gateway/cassandra-http-gateway/g` applied. It's here to make the subs" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [18:36:57] (03CR) 10Eevans: "This alters the config template to include the `tables` stanza, when tables has been defined in `values.yaml`. I was considering whether " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [18:37:57] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2332.codfw.wmnet [18:39:07] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2332.codfw.wmnet [18:42:13] !log 1.46.0-wmf.19 train status: no current blockers, going ahead to group1. [18:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:39] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250655 (https://phabricator.wikimedia.org/T413810) [18:42:41] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250655 (https://phabricator.wikimedia.org/T413810) (owner: 10TrainBranchBot) [18:43:03] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [18:43:33] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250655 (https://phabricator.wikimedia.org/T413810) (owner: 10TrainBranchBot) [18:44:14] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [18:44:25] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [18:45:24] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [18:45:27] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [18:49:10] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [18:49:23] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [18:49:27] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.19 refs T413810 [18:49:31] T413810: 1.46.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T413810 [18:52:04] (03CR) 10Ottomata: [C:03+1] topic: mw-page-edit-type-enrich-next [puppet] - 10https://gerrit.wikimedia.org/r/1249957 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [18:52:48] (03CR) 10Ssingh: "Yeah this is a good idea. I will review shortly." [puppet] - 10https://gerrit.wikimedia.org/r/1250626 (owner: 10Majavah) [18:54:52] (03PS1) 10Kgraessle: Add multilingual revert risk host header for LiftWing requests [extensions/AutoModerator] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250656 (https://phabricator.wikimedia.org/T419718) [18:55:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11699342 (10Jclark-ctr) additionally Sas Vs Sata drives. [18:55:18] (03CR) 10Ottomata: "One comment, but LGTM in general." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249360 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [18:55:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/AutoModerator] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250656 (https://phabricator.wikimedia.org/T419718) (owner: 10Kgraessle) [18:56:30] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough [19:01:29] (03CR) 10AKhatun: stream: mediawiki.page_edit_type_simple (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249360 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [19:01:36] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet [19:01:48] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cp7011.magru.wmnet [19:10:43] (03CR) 10Daniel Kinzler: [C:04-1] rest-gateway rate limiting: add CORS support (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [19:28:25] (03PS1) 10Arlolra: Show category index when no category selected on Special:LintTemplateErrors [extensions/Linter] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1250665 (https://phabricator.wikimedia.org/T417363) [19:28:51] (03PS1) 10Arlolra: Show category index when no category selected on Special:LintTemplateErrors [extensions/Linter] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250666 (https://phabricator.wikimedia.org/T417363) [19:29:13] (03PS1) 10Andrew Bogott: Remove mention of cloudgw2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1250667 (https://phabricator.wikimedia.org/T419738) [19:32:00] (03CR) 10Andrew Bogott: [C:03+2] Remove mention of cloudgw2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1250667 (https://phabricator.wikimedia.org/T419738) (owner: 10Andrew Bogott) [19:34:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/Linter] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1250665 (https://phabricator.wikimedia.org/T417363) (owner: 10Arlolra) [19:34:25] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11699470 (10MoritzMuehlenhoff) [19:34:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/Linter] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250666 (https://phabricator.wikimedia.org/T417363) (owner: 10Arlolra) [19:35:08] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11699472 (10phaultfinder) [19:37:11] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-backup1004.eqiad.wmnet with OS trixie [19:37:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11699477 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-backup1004.eqiad.wmnet with OS trixie executed with errors: - ms-... [19:40:43] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11699489 (10MoritzMuehlenhoff) [19:47:28] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11699517 (10BCornwall) Some commands I ran, e.g. for esams: `lang=bash $ sudo -i cumin 'ganeti3* or lvs3* or cp3* or dns3*' 'grep... [19:50:10] (03PS1) 10Ebernhardson: semantic-test: Test adding a non-master node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250672 [19:51:10] !log andrew@cumin2002 START - Cookbook sre.dns.netbox [19:52:08] (03CR) 10CI reject: [V:04-1] semantic-test: Test adding a non-master node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250672 (owner: 10Ebernhardson) [19:54:54] !log andrew@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:57:01] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11699535 (10BCornwall) [19:58:56] (03PS2) 10Ebernhardson: semantic-test: Test adding a non-master node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250672 [19:59:37] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7010.magru.wmnet with OS trixie [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T2000). [20:00:05] JSherman, sfaci, bvibber, anzx, and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] o/ [20:00:12] o/ [20:00:18] o/ [20:01:59] o/ [20:02:08] o/ [20:02:47] (03PS2) 10Thcipriani: Blubber: bump blubber to 1.8.1; set setuptools version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1250619 (https://phabricator.wikimedia.org/T418253) [20:03:02] sfaci is around but I will deploy his patch [20:03:16] Thanks Clare! [20:03:18] I can self deploy [20:03:33] I'm willing to deploy for others if needed [20:03:35] i can self-deploy mine (happy to go last if people are in a hurry too) [20:03:46] (03CR) 10Daniel Kinzler: [C:04-1] rest-gateway rate limiting: add CORS support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [20:04:10] are there any no-image-rebuild-triggering config changes that might be quicker than the rest? [20:04:34] o/ [20:04:41] no localization changes in mine [20:05:39] bvibber: how about you go first then? [20:05:44] ok! [20:06:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1250647 (https://phabricator.wikimedia.org/T416063) (owner: 10Bvibber) [20:06:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250648 (https://phabricator.wikimedia.org/T416063) (owner: 10Bvibber) [20:08:06] I'm happy to include other patches in my deploy (or have mine included in another deploy) if anyone feels theirs is low risk of revert [20:08:55] (03PS3) 10Ebernhardson: semantic-test: Test adding a non-master node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250672 [20:08:57] mine/ours should be low risk - https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1250632 [20:09:37] (03Merged) 10jenkins-bot: Revert "Fix for temp section open during slow loads on Parsoid" [extensions/MobileFrontend] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1250647 (https://phabricator.wikimedia.org/T416063) (owner: 10Bvibber) [20:09:42] (03Merged) 10jenkins-bot: Revert "Fix for temp section open during slow loads on Parsoid" [extensions/MobileFrontend] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250648 (https://phabricator.wikimedia.org/T416063) (owner: 10Bvibber) [20:10:08] cjming: agree! [20:10:19] !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1250647|Revert "Fix for temp section open during slow loads on Parsoid" (T416063 T419170 T419721)]], [[gerrit:1250648|Revert "Fix for temp section open during slow loads on Parsoid" (T416063 T419170 T419721)]] [20:10:26] T416063: Section collapsing in Parsoid version not resilient in case of slow connections - https://phabricator.wikimedia.org/T416063 [20:10:26] T419170: Talk pages on mobile with Parsoid are unusable when there are level 1 headers - https://phabricator.wikimedia.org/T419170 [20:10:27] T419721: Various client errors relating to MobileFrontend section collapsing - https://phabricator.wikimedia.org/T419721 [20:10:58] 😎 20:10:31 0 languages rebuilt out of 545 [20:11:46] i18n cache invalidation counts are like golf scores; lower == better! [20:11:50] if that new localization cache mode ever pans out in production, it's gonna be great -- will take like 30s to do a full rebuild instead of ages [20:12:29] !log bvibber@deploy2002 bvibber: Backport for [[gerrit:1250647|Revert "Fix for temp section open during slow loads on Parsoid" (T416063 T419170 T419721)]], [[gerrit:1250648|Revert "Fix for temp section open during slow loads on Parsoid" (T416063 T419170 T419721)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:13:07] !log bvibber@deploy2002 bvibber: Continuing with sync [20:13:11] lookin' good [20:13:20] cjming: do have a preference over who deploys? I do not have feelings about it. [20:15:10] no! happy to do whatever is needed [20:15:44] no feelings either way here either [20:16:30] okey dokey, I'll go ahead and deploy when it's time [20:16:37] ty! [20:17:06] !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1250647|Revert "Fix for temp section open during slow loads on Parsoid" (T416063 T419170 T419721)]], [[gerrit:1250648|Revert "Fix for temp section open during slow loads on Parsoid" (T416063 T419170 T419721)]] (duration: 06m 47s) [20:17:13] T416063: Section collapsing in Parsoid version not resilient in case of slow connections - https://phabricator.wikimedia.org/T416063 [20:17:13] T419170: Talk pages on mobile with Parsoid are unusable when there are level 1 headers - https://phabricator.wikimedia.org/T419170 [20:17:13] T419721: Various client errors relating to MobileFrontend section collapsing - https://phabricator.wikimedia.org/T419721 [20:18:05] (03PS4) 10Ebernhardson: semantic-test: Test adding a non-master node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250672 [20:18:33] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11699669 (10RobH) Thanks everyone for the feedback, I'll fill out the templates and submit them over! [20:18:35] !log andrew@cumin2002 START - Cookbook sre.dns.netbox [20:19:58] FIRING: [3x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:21:18] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:21:37] (03CR) 10Ebernhardson: [C:03+2] semantic-test: Test adding a non-master node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250672 (owner: 10Ebernhardson) [20:22:21] (03PS1) 10Daniel Kinzler: rest gateway: add second Lua filter for header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250675 (https://phabricator.wikimedia.org/T418969) [20:22:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [extensions/PersonalDashboard] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250581 (https://phabricator.wikimedia.org/T419442) (owner: 10Jsn.sherman) [20:22:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [extensions/PersonalDashboard] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250582 (https://phabricator.wikimedia.org/T419517) (owner: 10Jsn.sherman) [20:22:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250632 (https://phabricator.wikimedia.org/T407570) (owner: 10Santiago Faci) [20:23:32] (03Merged) 10jenkins-bot: semantic-test: Test adding a non-master node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250672 (owner: 10Ebernhardson) [20:24:57] (03PS1) 10Daniel Kinzler: rest-gateway: enable Mar2026 policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250676 (https://phabricator.wikimedia.org/T417778) [20:25:15] (03CR) 10CDobbins: [V:03+1 C:03+2] prometheus: fix pooled host check (again) [puppet] - 10https://gerrit.wikimedia.org/r/1249407 (owner: 10CDobbins) [20:26:50] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11699721 (10ssingh) Thanks for completing both drmrs and esams in this round, @BCornwall. Nice work! [20:27:10] (03Merged) 10jenkins-bot: riskyArticleEdits: show page descriptions [extensions/PersonalDashboard] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250581 (https://phabricator.wikimedia.org/T419442) (owner: 10Jsn.sherman) [20:27:11] (03Merged) 10jenkins-bot: Fix Instrumentation on mobile view [extensions/PersonalDashboard] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250582 (https://phabricator.wikimedia.org/T419517) (owner: 10Jsn.sherman) [20:27:13] (03Merged) 10jenkins-bot: ext.wikimediaEvents: Updated Test Kitchen impact test experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250632 (https://phabricator.wikimedia.org/T407570) (owner: 10Santiago Faci) [20:27:15] (03PS1) 10Ebernhardson: opensearch-semantic-search-test: Reduce resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250678 [20:27:48] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1250581|riskyArticleEdits: show page descriptions (T419442)]], [[gerrit:1250582|Fix Instrumentation on mobile view (T419517)]], [[gerrit:1250632|ext.wikimediaEvents: Updated Test Kitchen impact test experiment (T407570)]] [20:27:55] T419442: Short descriptions aren't being displayed in PersonalDashboard edit cards - https://phabricator.wikimedia.org/T419442 [20:27:55] T419517: Fix Instrumentation on mobile view - https://phabricator.wikimedia.org/T419517 [20:27:56] T407570: Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570 [20:28:11] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on gitlab2002.wikimedia.org with reason: Upgrade [20:28:16] (03PS1) 10Daniel Kinzler: Revert: rest-gateway: enable Mar2026 policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250679 (https://phabricator.wikimedia.org/T417778) [20:28:38] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on gitlab1003.wikimedia.org with reason: Upgrade [20:29:48] (03CR) 10Ebernhardson: [C:03+2] opensearch-semantic-search-test: Reduce resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250678 (owner: 10Ebernhardson) [20:29:58] !log jsn@deploy2002 jsn, sfaci: Backport for [[gerrit:1250581|riskyArticleEdits: show page descriptions (T419442)]], [[gerrit:1250582|Fix Instrumentation on mobile view (T419517)]], [[gerrit:1250632|ext.wikimediaEvents: Updated Test Kitchen impact test experiment (T407570)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:30:18] cjming: and/or sfaci: please test [20:31:38] (03Merged) 10jenkins-bot: opensearch-semantic-search-test: Reduce resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250678 (owner: 10Ebernhardson) [20:32:15] ack [20:32:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:32:25] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7010.magru.wmnet with reason: host reimage [20:33:01] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [20:34:11] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [20:34:20] JSherman: Tested! It's ok! [20:34:26] !log jsn@deploy2002 jsn, sfaci: Continuing with sync [20:34:31] Thanks! [20:34:42] sfaci: no problem! [20:36:00] anzx: & arlolra: do you need someone to deploy for you, or will you self-deploy? [20:36:14] I can self deploy but should go last [20:36:22] JSherman: need someone to deploy for me [20:37:00] anzx: ok, I'll deploy your patch once this one finishes [20:37:05] ok [20:37:18] arlolra: I'll ping you for handoff [20:37:24] thanks [20:37:41] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7010.magru.wmnet with reason: host reimage [20:37:57] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ml-serve1014.eqiad.wmnet with reason: T400626 [20:38:02] T400626: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626 [20:38:25] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1250581|riskyArticleEdits: show page descriptions (T419442)]], [[gerrit:1250582|Fix Instrumentation on mobile view (T419517)]], [[gerrit:1250632|ext.wikimediaEvents: Updated Test Kitchen impact test experiment (T407570)]] (duration: 10m 37s) [20:38:31] T419442: Short descriptions aren't being displayed in PersonalDashboard edit cards - https://phabricator.wikimedia.org/T419442 [20:38:31] T419517: Fix Instrumentation on mobile view - https://phabricator.wikimedia.org/T419517 [20:38:32] T407570: Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570 [20:38:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250579 (https://phabricator.wikimedia.org/T415974) (owner: 10Anzx) [20:39:44] (03Merged) 10jenkins-bot: urwikisource: add logo, sitename and projectnamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250579 (https://phabricator.wikimedia.org/T415974) (owner: 10Anzx) [20:40:12] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1250579|urwikisource: add logo, sitename and projectnamespace (T415974)]] [20:40:17] T415974: Post-creation work for urwikisource - https://phabricator.wikimedia.org/T415974 [20:42:21] !log jsn@deploy2002 anzx, jsn: Backport for [[gerrit:1250579|urwikisource: add logo, sitename and projectnamespace (T415974)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:42:41] anzx: please test [20:43:07] JSherman: looks good, ok to sync [20:43:11] !log jsn@deploy2002 anzx, jsn: Continuing with sync [20:43:17] okey dokey [20:47:07] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1250579|urwikisource: add logo, sitename and projectnamespace (T415974)]] (duration: 06m 55s) [20:47:11] T415974: Post-creation work for urwikisource - https://phabricator.wikimedia.org/T415974 [20:47:16] anzx: all done! arlolra: all yours! [20:47:33] JSherman: thanks for deploying [20:47:39] no prob! [20:48:46] ok [20:49:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [extensions/Linter] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1250665 (https://phabricator.wikimedia.org/T417363) (owner: 10Arlolra) [20:49:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [extensions/Linter] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250666 (https://phabricator.wikimedia.org/T417363) (owner: 10Arlolra) [20:53:43] (03Merged) 10jenkins-bot: Show category index when no category selected on Special:LintTemplateErrors [extensions/Linter] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1250665 (https://phabricator.wikimedia.org/T417363) (owner: 10Arlolra) [20:53:45] (03Merged) 10jenkins-bot: Show category index when no category selected on Special:LintTemplateErrors [extensions/Linter] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250666 (https://phabricator.wikimedia.org/T417363) (owner: 10Arlolra) [20:54:17] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1250665|Show category index when no category selected on Special:LintTemplateErrors (T417363)]], [[gerrit:1250666|Show category index when no category selected on Special:LintTemplateErrors (T417363)]] [20:54:21] T417363: Special:LintTemplateErrors with no category selected should display an index - https://phabricator.wikimedia.org/T417363 [20:55:39] (03PS1) 10Bking: wdqs: allow NFS mount from wdqs2009 [puppet] - 10https://gerrit.wikimedia.org/r/1250683 (https://phabricator.wikimedia.org/T415073) [20:56:42] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1250683 (https://phabricator.wikimedia.org/T415073) (owner: 10Bking) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T2100) [21:00:45] arlolra: Do you know how long you'll be? [21:01:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:02:50] (03PS2) 10Bking: wdqs: allow NFS mount from wdqs2009 [puppet] - 10https://gerrit.wikimedia.org/r/1250683 (https://phabricator.wikimedia.org/T415073) [21:02:55] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1250683 (https://phabricator.wikimedia.org/T415073) (owner: 10Bking) [21:04:15] (03PS1) 10Ebernhardson: opensearch-semantic-search: Increse memory quota to 650G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250686 (https://phabricator.wikimedia.org/T414091) [21:04:17] (03PS1) 10Ebernhardson: opensearch-semantic-search: Scale for additional wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250687 (https://phabricator.wikimedia.org/T414091) [21:04:55] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11699919 (10RobH) > To:CustomerCare@digitalrealty.com > Wikimedia metrics for Energy Efficiency Directive (EED) - AMS17 and MRS02 >... [21:04:59] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7010.magru.wmnet with OS trixie [21:05:59] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11699921 (10RobH) Ok, that is annoying, these auto created tasks cannot have things appended into the task descirption or phaultfinder removes it... === Troublesho... [21:06:00] (03CR) 10Ebernhardson: [C:04-2] "Not currently deployable, hosts limit pods to 32gb currently. Waiting for confirmation from dpe-sre if they would prefer more pods, or cha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250687 (https://phabricator.wikimedia.org/T414091) (owner: 10Ebernhardson) [21:06:49] arlolra: We really do want to use our deployment window… [21:07:57] James_F: sorry, the images are building [21:08:04] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7010.* [21:08:09] it's been running for 18 mins [21:08:11] Yeah, because you're backporting an i18n change. [21:08:17] It'll take 40 minutes for that. [21:08:19] do you want me to stop [21:08:27] Well, I imagine you're back-porting for a reason. [21:08:49] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7009.magru.wmnet with OS trixie [21:08:55] But there's a reason that there's a limit on how many patches can go in each window. :-( [21:09:23] with any i18n the window is limited to one patch? [21:09:30] Effectively. [21:09:35] Aka "wait for the train". [21:10:17] the reason i'm backporting is to avoid waiting another week to send an announcement on tech news [21:10:25] i can stop [21:10:38] Eh. It'll be done soon enough. [21:10:40] although it's sync'ing now [21:10:42] The images are fully built now. [21:10:43] Yeah. [21:11:51] again, sorry. I should have asked knowing that I was going to go over the window [21:15:33] !log arlolra@deploy2002 arlolra: Backport for [[gerrit:1250665|Show category index when no category selected on Special:LintTemplateErrors (T417363)]], [[gerrit:1250666|Show category index when no category selected on Special:LintTemplateErrors (T417363)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:15:38] T417363: Special:LintTemplateErrors with no category selected should display an index - https://phabricator.wikimedia.org/T417363 [21:16:04] (03CR) 10JHathaway: [C:03+1] Remove mpic redirects to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1250250 (https://phabricator.wikimedia.org/T415845) (owner: 10Clare Ming) [21:16:12] !log arlolra@deploy2002 arlolra: Continuing with sync [21:22:56] (03CR) 10JHathaway: [C:03+2] Remove mpic redirects to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1250250 (https://phabricator.wikimedia.org/T415845) (owner: 10Clare Ming) [21:24:04] (03PS1) 10RLazarus: Update to v1.35.9 [debs/envoyproxy] (v1.35) - 10https://gerrit.wikimedia.org/r/1250688 (https://phabricator.wikimedia.org/T419637) [21:24:41] (03CR) 10Jforrester: OrchestratorRequest: Switch evaluations to v2 endpoint [extensions/WikiLambda] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250051 (https://phabricator.wikimedia.org/T413727) (owner: 10Jforrester) [21:25:42] (03CR) 10RLazarus: [C:03+2] Update to v1.35.9 [debs/envoyproxy] (v1.35) - 10https://gerrit.wikimedia.org/r/1250688 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [21:29:32] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1250665|Show category index when no category selected on Special:LintTemplateErrors (T417363)]], [[gerrit:1250666|Show category index when no category selected on Special:LintTemplateErrors (T417363)]] (duration: 35m 16s) [21:29:36] T417363: Special:LintTemplateErrors with no category selected should display an index - https://phabricator.wikimedia.org/T417363 [21:29:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250051 (https://phabricator.wikimedia.org/T413727) (owner: 10Jforrester) [21:29:58] (03PS1) 10Clare Ming: Test Kitchen UI: Deploy v1.2.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250692 (https://phabricator.wikimedia.org/T408233) [21:30:08] !log rzl@apt1002:~$ sudo -i reprepro -C component/envoy-future include bullseye-wikimedia /home/rzl/envoyproxy_1.35.9-1_amd64.changes [21:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:42] (03PS1) 10Clare Ming: Test Kitchen UI: Deploy v1.2.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250693 (https://phabricator.wikimedia.org/T408233) [21:34:27] (03CR) 10Bartosz Dziewoński: rest-gateway rate limiting: add CORS support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [21:35:22] (03Merged) 10jenkins-bot: OrchestratorRequest: Switch evaluations to v2 endpoint [extensions/WikiLambda] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250051 (https://phabricator.wikimedia.org/T413727) (owner: 10Jforrester) [21:35:56] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1250051|OrchestratorRequest: Switch evaluations to v2 endpoint (T413727)]] [21:36:00] T413727: Validate and Productionize /v2/evaluate - https://phabricator.wikimedia.org/T413727 [21:36:26] (03PS1) 10RLazarus: envoy-future: Update to v1.35.9 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1250697 (https://phabricator.wikimedia.org/T419637) [21:37:54] (03CR) 10RLazarus: [V:03+2] "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1250697 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [21:38:16] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.2.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250692 (https://phabricator.wikimedia.org/T408233) (owner: 10Clare Ming) [21:39:42] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7009.magru.wmnet with reason: host reimage [21:40:08] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1250051|OrchestratorRequest: Switch evaluations to v2 endpoint (T413727)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:40:19] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.2.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250692 (https://phabricator.wikimedia.org/T408233) (owner: 10Clare Ming) [21:41:52] 10ops-magru, 06DC-Ops, 06Traffic: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11700050 (10RobH) I had some ISP issues with upload speeds to magru, so Papaul helped me out and flashed the firmware for idrac, bios, and backplane. The error persists, so I'... [21:42:49] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:43:02] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7009.magru.wmnet with reason: host reimage [21:43:32] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:47:48] !log jforrester@deploy2002 jforrester: Continuing with sync [21:49:01] (03PS4) 10Jdlrobson: Enable parser survey for opted out users on all parsoid rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238787 (https://phabricator.wikimedia.org/T414852) [21:49:42] (03CR) 10Btullis: [C:03+1] wdqs: allow NFS mount from wdqs2009 [puppet] - 10https://gerrit.wikimedia.org/r/1250683 (https://phabricator.wikimedia.org/T415073) (owner: 10Bking) [21:54:15] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1250051|OrchestratorRequest: Switch evaluations to v2 endpoint (T413727)]] (duration: 18m 19s) [21:54:19] T413727: Validate and Productionize /v2/evaluate - https://phabricator.wikimedia.org/T413727 [21:55:27] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync [21:56:08] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync [21:57:51] Done at my end. [21:58:14] (03CR) 10Scott French: [C:03+1] envoy-future: Update to v1.35.9 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1250697 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T2200) [22:01:39] (03CR) 10RLazarus: [V:03+2 C:03+2] envoy-future: Update to v1.35.9 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1250697 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [22:04:58] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:14:58] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:25:31] jouncebot: nowandnext [22:25:31] For the next 0 hour(s) and 34 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260311T2200) [22:25:31] In 7 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T0600) [22:25:31] In 7 hour(s) and 34 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260312T0600) [22:27:41] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7009.magru.wmnet with OS trixie [22:27:42] borrowing mw-debug for an envoy update [22:29:14] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [22:29:38] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [22:44:48] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [22:45:04] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [22:45:11] done [22:46:21] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11700239 (10VRiley-WMF) @Dzahn Let me know if you're about to access this and if so, I will close it out. Thanks! [22:46:24] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.2.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250693 (https://phabricator.wikimedia.org/T408233) (owner: 10Clare Ming) [22:47:55] 10ops-magru, 06DC-Ops, 06Traffic: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11700241 (10RobH) a:05RobH→03BCornwall @BCornwall, After firmware updates and resetting the SEL and rebooting the issue now seems to have cleared up. The collection log... [22:48:02] (03PS1) 10RLazarus: mathoid: Upgrade to envoy-future:1.35.9 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250728 (https://phabricator.wikimedia.org/T419637) [22:48:28] 10ops-magru, 06DC-Ops, 06Traffic: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11700244 (10RobH) [22:48:28] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.2.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250693 (https://phabricator.wikimedia.org/T408233) (owner: 10Clare Ming) [22:48:39] 10ops-magru, 06DC-Ops, 06Traffic: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11700247 (10RobH) [22:51:53] (03PS1) 10RLazarus: {api,rest}-gateway: Update staging to Envoy 1.35.9 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250731 (https://phabricator.wikimedia.org/T419637) [22:52:30] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [22:52:47] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [22:56:07] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11700259 (10Dzahn) @VRiley-WMF Thank you very much! It works and I can connect :)) [22:56:28] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11700260 (10Dzahn) 05Open→03Resolved [23:04:06] (03PS1) 10Dzahn: site/jenkins: apply jenkins role on contint1003 [puppet] - 10https://gerrit.wikimedia.org/r/1250743 (https://phabricator.wikimedia.org/T418521) [23:06:11] (03PS2) 10Dzahn: site/jenkins: apply jenkins role on contint1003 [puppet] - 10https://gerrit.wikimedia.org/r/1250743 (https://phabricator.wikimedia.org/T418521) [23:07:24] (03CR) 10Dzahn: [C:03+2] site/jenkins: apply jenkins role on contint1003 [puppet] - 10https://gerrit.wikimedia.org/r/1250743 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [23:15:20] (03CR) 10Dzahn: [C:03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1250571 (owner: 10Jelto) [23:21:34] PROBLEM - Check if dnsdist.service has been restarted after /etc/dnsdist/dnsdist.conf was changed on doh1002 is CRITICAL: CRITICAL: Service dnsdist.service has not been restarted after /etc/dnsdist/dnsdist.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [23:30:22] RECOVERY - Check if dnsdist.service has been restarted after /etc/dnsdist/dnsdist.conf was changed on doh1002 is OK: OK: dnsdist.service was restarted after /etc/dnsdist/dnsdist.conf was changed. https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [23:33:00] (03PS1) 10Dzahn: jenkins: add proxy_jenkins profile to role [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) [23:38:43] (03CR) 10Dzahn: [V:04-1] "did not find a value for the name 'profile::ci::proxy_jenkins::http_port'" [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [23:39:50] (03PS2) 10Dzahn: jenkins: add proxy_jenkins profile to role [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) [23:40:46] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1250748/8256/contint2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [23:41:46] (03CR) 10Dzahn: [V:03+1] "This drops the proxy config in /etc/apache2/ but we don't have apache installed yet. WIP" [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [23:42:30] (03PS3) 10Dzahn: ci::website: support 2 different websites, integration vs zuul-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1248118 (https://phabricator.wikimedia.org/T418521) [23:47:41] (03PS1) 10C. Scott Ananian: Enable parser survey for opted-out users on ru/pt/ja/id wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250750 (https://phabricator.wikimedia.org/T414852) [23:56:13] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7009.* [23:58:05] (03CR) 10Dzahn: [V:03+1 C:03+2] "This allows us to host 2 different websites or apache configs on "legacy CI" vs "jenkins". noop on existing contint: https://puppet-comp" [puppet] - 10https://gerrit.wikimedia.org/r/1248118 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [23:59:03] (03PS3) 10Dzahn: ci::website/ci::httpd: move monitoring to website, not httpd [puppet] - 10https://gerrit.wikimedia.org/r/1248127 (https://phabricator.wikimedia.org/T418521)