[00:06:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:07:54] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:16:25] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:17:54] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:11:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1264140 [01:11:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1264140 (owner: 10TrainBranchBot) [01:21:07] (03CR) 10Zabe: [C:03+2] "retry" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1264118 (owner: 10TrainBranchBot) [01:25:20] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1264140 (owner: 10TrainBranchBot) [01:32:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1264118 (owner: 10TrainBranchBot) [02:00:55] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:01:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:07:45] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 50s) [02:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:16:38] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [02:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:11:33] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:55:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:01:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:02:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:37:06] (03PS1) 10KartikMistry: Update cxserver to 2026-03-25-072715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264221 [05:01:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:02:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:03:02] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 68810840 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:04:02] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3484120 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:14:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1022.eqiad.wmnet with reason: Downgrade clouddb1022 to 10.11.13 [05:16:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1022.eqiad.wmnet with reason: Downgrade clouddb1022 to 10.11.13 [05:16:17] RESOLVED: [3x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:16:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [05:53:52] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264257 [05:56:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [06:03:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:06:11] (03PS1) 10Tiziano Fogli: prometheus4002: clean up unused Hiera variables [puppet] - 10https://gerrit.wikimedia.org/r/1264265 (https://phabricator.wikimedia.org/T419430) [06:06:13] (03PS1) 10Tiziano Fogli: Switch prometheus3004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264266 (https://phabricator.wikimedia.org/T419960) [06:06:13] (03PS1) 10Tiziano Fogli: Switch prometheus5002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264267 (https://phabricator.wikimedia.org/T419960) [06:06:14] (03PS1) 10Tiziano Fogli: Switch prometheus6002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264268 (https://phabricator.wikimedia.org/T419960) [06:06:15] (03PS1) 10Tiziano Fogli: Switch prometheus7002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264269 (https://phabricator.wikimedia.org/T419960) [06:07:56] (03PS2) 10Arnaudb: gerrit: adjust idleTimeout on Jetty [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T420909) [06:08:37] (03PS3) 10Arnaudb: gerrit: adjust idleTimeout on Jetty [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T420909) [06:09:02] (03CR) 10Arnaudb: "sounds good to me, updated!" [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [06:09:25] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [06:13:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [06:15:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:44:33] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1264268 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli) [06:45:07] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1264267 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli) [06:45:39] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1264266 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli) [06:46:06] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1264269 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli) [06:47:33] (03CR) 10Fabfur: [C:03+2] aptrepo: updates configuration for haproxy32 [puppet] - 10https://gerrit.wikimedia.org/r/1262146 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [06:47:43] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1264265 (https://phabricator.wikimedia.org/T419430) (owner: 10Tiziano Fogli) [06:53:28] (03PS1) 10Muehlenhoff: Fix Cumin alias for ganeti-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1264296 (https://phabricator.wikimedia.org/T418993) [06:55:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:57:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:58:11] (03CR) 10Tiziano Fogli: [C:03+2] prometheus4002: clean up unused Hiera variables [puppet] - 10https://gerrit.wikimedia.org/r/1264265 (https://phabricator.wikimedia.org/T419430) (owner: 10Tiziano Fogli) [06:59:32] (03CR) 10Tiziano Fogli: [C:03+2] Switch prometheus3004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264266 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli) [07:00:05] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260330T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:03:41] hashar: if I got a last-minute patch up for T421458, would you be willing to review/deploy & run a throttle-resetting maintenance script? (pinging you as you mentioned previously in -operations to ping you if there's nobody around for deployments at this time of day, i hope that's okay :) ) [07:03:43] T421458: Lift IP cap on 2026-03-30 for Students Write Wikipedia course - cs.wikipedia - https://phabricator.wikimedia.org/T421458 [07:04:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:04:33] !log prometheus3004: switch to nftables and reboot (T419960) [07:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS trixie [07:05:35] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11762861 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast4006.wikimedia.org w... [07:08:06] !log prometheus4003: reboot (T419960) [07:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:42] (03CR) 10Tiziano Fogli: [C:03+2] Switch prometheus5002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264267 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli) [07:11:00] !log prometheus5002: switch to nftables and reboot (T419960) [07:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:23] (03CR) 10Tiziano Fogli: [C:03+2] Switch prometheus6002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264268 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli) [07:18:10] !log prometheus6002: switch to nftables and reboot (T419960) [07:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:44] (03Abandoned) 10Arnaudb: Revert "gerrit: align ATS/Envoy/Apache timeouts" [puppet] - 10https://gerrit.wikimedia.org/r/1261961 (owner: 10Arnaudb) [07:23:45] quit [07:24:37] (03CR) 10Tiziano Fogli: [C:03+2] Switch prometheus7002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264269 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli) [07:24:56] !log prometheus7002: switch to nftables and reboot (T419960) [07:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:45] (03CR) 10JavierMonton: [V:03+1] eventstreams-internal - increase kafka max message size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254210 (https://phabricator.wikimedia.org/T420356) (owner: 10Ottomata) [07:36:01] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1260135 (owner: 10Andrew Bogott) [07:38:15] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host bast4006.wikimedia.org with OS trixie [07:38:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [07:38:28] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11762891 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast4006.wikimedia.org with... [07:39:13] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:39:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:39:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by javiermonton@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261377 (https://phabricator.wikimedia.org/T421341) (owner: 10JavierMonton) [07:39:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [07:41:17] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:41:39] (03Merged) 10jenkins-bot: stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261377 (https://phabricator.wikimedia.org/T421341) (owner: 10JavierMonton) [07:42:31] !log javiermonton@deploy1003 Started scap sync-world: Backport for [[gerrit:1261377|stream: mediawiki.page_html_content_change (T421341)]] [07:42:39] T421341: Update HTML pipeline schema - rendering_content_change - https://phabricator.wikimedia.org/T421341 [07:44:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [07:45:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS trixie [07:45:33] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host bast4006.wikimedia.org with OS trixie [07:45:45] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11762898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast4006.wikimedia.org w... [07:45:47] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11762901 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast4006.wikimedia.org with... [07:46:17] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:48:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS trixie [07:48:32] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host bast4006.wikimedia.org with OS trixie [07:48:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS trixie [07:49:07] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261475 (https://phabricator.wikimedia.org/T421341) (owner: 10JavierMonton) [07:49:12] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11762903 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast4006.wikimedia.org w... [07:49:13] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:49:23] (03CR) 10JMeybohm: [C:03+2] k8s.print-network-topology: Prevent SAL logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1259082 (owner: 10JMeybohm) [07:50:35] (03CR) 10Ayounsi: [C:03+1] Nokia: BGP policy for unicast bgp sw_external outside peerings [homer/public] - 10https://gerrit.wikimedia.org/r/1262197 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [07:50:53] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [07:51:31] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261475 (https://phabricator.wikimedia.org/T421341) (owner: 10JavierMonton) [07:51:32] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:54:04] (03Merged) 10jenkins-bot: k8s.print-network-topology: Prevent SAL logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1259082 (owner: 10JMeybohm) [07:54:42] !log deploy rabbitmq changes to allow cli communication - T420923 [07:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:47] T420923: rabbitmqctl list_queues in eqiad/codfw times out after 60s - https://phabricator.wikimedia.org/T420923 [07:54:58] (03CR) 10Filippo Giunchedi: [C:03+2] rabbitmq: enable cli tools peers communication [puppet] - 10https://gerrit.wikimedia.org/r/1261366 (https://phabricator.wikimedia.org/T420923) (owner: 10Filippo Giunchedi) [07:55:19] (03CR) 10Ayounsi: [C:03+1] "I didn't know things were broken. +1 to manually fix it and +1 to that change." [homer/public] - 10https://gerrit.wikimedia.org/r/1260734 (https://phabricator.wikimedia.org/T420821) (owner: 10Cathal Mooney) [07:55:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:55:53] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [07:56:32] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:00:23] !log javiermonton@deploy1003 javiermonton: Backport for [[gerrit:1261377|stream: mediawiki.page_html_content_change (T421341)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:00:30] T421341: Update HTML pipeline schema - rendering_content_change - https://phabricator.wikimedia.org/T421341 [08:00:53] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [08:02:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:02:18] (03PS1) 10Filippo Giunchedi: rabbitmq: fix firewall port range for cli tools [puppet] - 10https://gerrit.wikimedia.org/r/1264347 (https://phabricator.wikimedia.org/T420923) [08:03:09] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] rabbitmq: fix firewall port range for cli tools [puppet] - 10https://gerrit.wikimedia.org/r/1264347 (https://phabricator.wikimedia.org/T420923) (owner: 10Filippo Giunchedi) [08:03:22] (03CR) 10Ayounsi: [C:03+1] Fix Cumin alias for ganeti-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1264296 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [08:03:49] !log javiermonton@deploy1003 javiermonton: Continuing with sync [08:14:05] (03PS1) 10Filippo Giunchedi: rabbitmq: make server erlang distribution listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1264355 (https://phabricator.wikimedia.org/T420923) [08:14:33] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] rabbitmq: make server erlang distribution listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1264355 (https://phabricator.wikimedia.org/T420923) (owner: 10Filippo Giunchedi) [08:14:48] (03CR) 10JMeybohm: [C:03+2] trafficserver: 100% of /feed/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1260690 (https://phabricator.wikimedia.org/T421233) (owner: 10Clément Goubert) [08:15:04] (03CR) 10Muehlenhoff: [C:03+2] Fix Cumin alias for ganeti-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1264296 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [08:17:42] !log javiermonton@deploy1003 Finished scap sync-world: Backport for [[gerrit:1261377|stream: mediawiki.page_html_content_change (T421341)]] (duration: 35m 10s) [08:17:50] T421341: Update HTML pipeline schema - rendering_content_change - https://phabricator.wikimedia.org/T421341 [08:18:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.73% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:23:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.73% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:25:17] (03PS1) 10Filippo Giunchedi: rabbitmq: use correct erlang distribution ports on firewall [puppet] - 10https://gerrit.wikimedia.org/r/1264360 (https://phabricator.wikimedia.org/T420923) [08:25:37] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] rabbitmq: use correct erlang distribution ports on firewall [puppet] - 10https://gerrit.wikimedia.org/r/1264360 (https://phabricator.wikimedia.org/T420923) (owner: 10Filippo Giunchedi) [08:26:29] (03PS3) 10Tiziano Fogli: prometheus/pop: consolidate the firewall provider declaration at the role level. [puppet] - 10https://gerrit.wikimedia.org/r/1264339 (https://phabricator.wikimedia.org/T419960) [08:29:26] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1264339 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli) [08:32:01] (03PS1) 10Ilias Sarantopoulos: ml-services: update doc links in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264365 (https://phabricator.wikimedia.org/T406369) [08:34:18] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update doc links in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264365 (https://phabricator.wikimedia.org/T406369) (owner: 10Ilias Sarantopoulos) [08:34:19] 06SRE, 06ServiceOps new, 07Datacenter-Switchover: Increased rate of badtoken errors / session store issues due to datacenter switchover? - https://phabricator.wikimedia.org/T421168#11762999 (10MLechvien-WMF) Hi, The correlation with DC Switchover does not seem obvious, for example it seems there was another... [08:34:37] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host bast4006.wikimedia.org with OS trixie [08:35:12] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:35:12] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:37:18] !log prometheus[12]005: reboot (T419960) [08:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:56] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update doc links in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264365 (https://phabricator.wikimedia.org/T406369) (owner: 10Ilias Sarantopoulos) [08:38:12] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:38:12] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:38:13] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [08:38:19] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [08:39:24] PROBLEM - Host prometheus2005 is DOWN: PING CRITICAL - Packet loss = 100% [08:39:46] (03PS1) 10Ayounsi: Management routers: move standard security_zones to roles.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1264367 (https://phabricator.wikimedia.org/T421674) [08:40:44] PROBLEM - Host prometheus1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:40:58] FIRING: [4x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:41:03] (03PS1) 10MVernon: swift: drain 3 codfw nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1264368 (https://phabricator.wikimedia.org/T354872) [08:43:02] RECOVERY - Host prometheus2005 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [08:43:44] RECOVERY - Host prometheus1005 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [08:45:12] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:45:53] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [08:45:58] FIRING: [4x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:46:12] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:47:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:47:20] (03CR) 10Elukey: profile::base::certificates: rename Puppet Internal CA's path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1262055 (owner: 10Elukey) [08:49:44] 10ops-magru: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T419298#11763091 (10ayounsi) p:05Triage→03Low a:03ayounsi Ultimately Juniper, I'll take the task for now. [08:50:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.411s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:50:53] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [08:50:58] RESOLVED: [4x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:51:25] !log prometheus[12]007: reboot (T419960) [08:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:52:37] !log push pfw policy - T421556 [08:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:18] PROBLEM - Host prometheus1007 is DOWN: PING CRITICAL - Packet loss = 100% [08:53:18] PROBLEM - Host prometheus2007 is DOWN: PING CRITICAL - Packet loss = 100% [08:55:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.411s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:55:54] RECOVERY - Host prometheus1007 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [08:55:54] RECOVERY - Host prometheus2007 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms [08:56:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS trixie [09:00:30] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:00:53] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:00:55] (03CR) 10Elukey: [C:03+2] java: add java-21-security erb template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1251836 (https://phabricator.wikimedia.org/T420083) (owner: 10Elukey) [09:02:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:03:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:05:53] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:07:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:10:51] !log prometheus[12]006: reboot (T419960) [09:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:46] !log prometheus[12]008: reboot (T419960) [09:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:02] PROBLEM - SSH on prometheus2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:13:18] PROBLEM - Host prometheus1006 is DOWN: PING CRITICAL - Packet loss = 100% [09:13:24] PROBLEM - Host prometheus1008 is DOWN: PING CRITICAL - Packet loss = 100% [09:14:02] PROBLEM - SSH on prometheus2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:14:58] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 12200 [09:15:44] RECOVERY - Host prometheus1006 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [09:15:51] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12200 [09:15:54] RECOVERY - Host prometheus1008 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [09:15:58] PROBLEM - Host prometheus2008 is DOWN: PING CRITICAL - Packet loss = 100% [09:15:58] PROBLEM - Host prometheus2006 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:46] RECOVERY - Host prometheus2008 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [09:16:46] RECOVERY - Host prometheus2006 is UP: PING OK - Packet loss = 0%, RTA = 30.47 ms [09:16:52] RECOVERY - SSH on prometheus2008 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:16:52] RECOVERY - SSH on prometheus2006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:17:15] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 42 [09:17:33] (03CR) 10Cathal Mooney: [C:03+1] "Nice, thats a lot cleaner, and good reason to standardise interface usage which I’d not really thought was too important :)" [homer/public] - 10https://gerrit.wikimedia.org/r/1264367 (https://phabricator.wikimedia.org/T421674) (owner: 10Ayounsi) [09:18:28] FIRING: [8x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:18:38] (03CR) 10Ayounsi: [C:03+2] Management routers: move standard security_zones to roles.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1264367 (https://phabricator.wikimedia.org/T421674) (owner: 10Ayounsi) [09:18:43] FIRING: [8x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:18:45] 07sre-alert-triage, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414970#11763162 (10BTullis) 05Open→03Resolved a:03BTullis I'm resolving this ticket, since it is historical. This is one of the cases where we wish... [09:19:59] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 42 [09:20:49] (03Merged) 10jenkins-bot: Management routers: move standard security_zones to roles.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1264367 (https://phabricator.wikimedia.org/T421674) (owner: 10Ayounsi) [09:22:21] (03CR) 10Volans: [C:03+1] "Cookbook wise LGTM, I'll leave it to your team for the thanos-specific bits :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1261375 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [09:23:28] RESOLVED: [8x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:26:19] (03PS2) 10Filippo Giunchedi: openstack: enable rabbit transient quorum queues [puppet] - 10https://gerrit.wikimedia.org/r/1261374 (https://phabricator.wikimedia.org/T421054) [09:26:19] (03PS1) 10Filippo Giunchedi: openstack: enable trove-guestagent rabbit transient quorum queues [puppet] - 10https://gerrit.wikimedia.org/r/1264557 (https://phabricator.wikimedia.org/T421054) [09:29:54] (03CR) 10Filippo Giunchedi: "> Indeed testing in codfw first SGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1261374 (https://phabricator.wikimedia.org/T421054) (owner: 10Filippo Giunchedi) [09:42:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:42:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host bast4006.wikimedia.org with OS trixie [09:45:28] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: enable rabbit transient quorum queues [puppet] - 10https://gerrit.wikimedia.org/r/1261374 (https://phabricator.wikimedia.org/T421054) (owner: 10Filippo Giunchedi) [09:46:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS bookworm [09:47:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:55:46] 06SRE, 07SRE-Unowned, 06WMF-Legal, 07SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437#11763322 (10Bugreporter) 05Open→03Declined Close as declined since WMF plans to shut down all Wikinewses (https://meta.wiki... [09:55:46] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1260716 (owner: 10Majavah) [09:55:53] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:57:12] (03PS2) 10Fabfur: aptrepo,haproxy: add haproxy-awslc component/package [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) [09:57:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:00:03] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8361/console" [puppet] - 10https://gerrit.wikimedia.org/r/1260716 (owner: 10Majavah) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260330T1000) [10:00:30] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:02:12] (03CR) 10Vgutierrez: [C:04-1] "`modules/haproxy/manifests/init.pp` needs to be updated to install `haproxy-awslc` instead of `haproxy` package" [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [10:04:17] (03CR) 10MVernon: [C:03+1] "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/1260716 (owner: 10Majavah) [10:05:05] 06SRE, 07SRE-Unowned, 07SEO: Index pl.wikinews in Google Publisher Center - https://phabricator.wikimedia.org/T393288#11763372 (10Bugreporter) 05Open→03Declined Close as declined since WMF plans to shut down all Wikinewses (https://meta.wikimedia.org/w/index.php?title=Wikimedia_Foundation_Board_notic... [10:05:50] (03CR) 10Ladsgroup: [C:03+1] swift: drain 3 codfw nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1264368 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [10:06:20] (03CR) 10MVernon: [C:03+2] swift: drain 3 codfw nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1264368 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [10:08:51] 06SRE: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988#11763403 (10Bugreporter) 05Stalled→03Declined Close as declined since WMF plans to shut down all Wikinewses (https://meta.wikimedia.org/w/index.php?title=Wikimedia_Foundation_Board... [10:08:56] (03CR) 10JMeybohm: [C:03+2] trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [10:09:43] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast4006.wikimedia.org with reason: host reimage [10:13:16] (03CR) 10Btullis: [C:03+2] Add an LDAP group to the list considered during offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1261481 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [10:14:56] (03PS6) 10Btullis: Route dse-k8s API blackbox checks to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) [10:14:59] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11763431 (10MatthewVernon) [10:15:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast4006.wikimedia.org with reason: host reimage [10:18:41] (03PS19) 10Vgutierrez: prometheus::ops: Monitor IPIP realservers [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) [10:19:05] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [10:21:36] (03CR) 10Ladsgroup: [C:03+1] Start reading from new file table in dewiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264110 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [10:23:26] (03PS1) 101F616EMO: zhwikinews: 20th anniversary logo change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264569 (https://phabricator.wikimedia.org/T420165) [10:28:39] (03PS1) 101F616EMO: Revert "zhwikinews: 20th anniversary logo change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165) [10:28:53] (03CR) 10Btullis: wdqs-queryhammer: Deployment fixes (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [10:30:04] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis) [10:33:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264569 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO) [10:36:11] 06SRE, 10MediaModeration, 06Product Safety and Integrity: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11763501 (10mszwarc) [10:37:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast4006.wikimedia.org with OS bookworm [10:38:02] 06SRE, 10MediaModeration, 06Product Safety and Integrity: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11763503 (10mszwarc) [10:38:25] 06SRE, 10MediaModeration, 06Product Safety and Integrity: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11763504 (10mszwarc) [10:40:35] (03CR) 10Btullis: [C:03+2] Route dse-k8s API blackbox checks to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis) [10:43:33] (03PS1) 10Filippo Giunchedi: openstack: set oslo lock path where missing [puppet] - 10https://gerrit.wikimedia.org/r/1264577 (https://phabricator.wikimedia.org/T421054) [10:44:16] (03PS1) 10Kosta Harlan: hCaptcha: Add APCu cache layer to health checker [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264578 (https://phabricator.wikimedia.org/T421204) [10:44:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264578 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [10:45:02] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: set oslo lock path where missing [puppet] - 10https://gerrit.wikimedia.org/r/1264577 (https://phabricator.wikimedia.org/T421054) (owner: 10Filippo Giunchedi) [10:45:15] (03PS2) 10Filippo Giunchedi: openstack: set oslo lock path where missing [puppet] - 10https://gerrit.wikimedia.org/r/1264577 (https://phabricator.wikimedia.org/T421054) [10:45:24] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] openstack: set oslo lock path where missing [puppet] - 10https://gerrit.wikimedia.org/r/1264577 (https://phabricator.wikimedia.org/T421054) (owner: 10Filippo Giunchedi) [10:45:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262091 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [10:48:24] (03CR) 10Btullis: [C:03+2] Apply the new VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [10:53:43] (03PS6) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259946 (https://phabricator.wikimedia.org/T418146) [10:53:51] (03CR) 10JMeybohm: [C:03+2] trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259946 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [10:54:14] 06SRE, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11763577 (10BTullis) [10:56:27] (03Merged) 10jenkins-bot: Apply the new VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [11:04:28] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:05:05] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:05:35] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [11:05:38] (03PS1) 10Ladsgroup: Switch from InterwikiSortingPrepend to the ULS config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264581 [11:06:01] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [11:07:22] (03PS2) 10Majavah: cephadm::target: Convert port to an integer [puppet] - 10https://gerrit.wikimedia.org/r/1260716 [11:07:56] (03CR) 10Majavah: [C:03+2] cephadm::target: Convert port to an integer [puppet] - 10https://gerrit.wikimedia.org/r/1260716 (owner: 10Majavah) [11:11:25] (03CR) 10Ladsgroup: [C:03+1] "shall we do it?" [puppet] - 10https://gerrit.wikimedia.org/r/1242430 (owner: 10Muehlenhoff) [11:12:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:15:20] (03PS20) 10Vgutierrez: prometheus::ops: Monitor IPIP realservers [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) [11:15:57] (03CR) 10Vgutierrez: "produced metrics with PS20 python script: https://phabricator.wikimedia.org/P89966" [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [11:15:59] (03PS2) 10Majavah: cloudnfs: Remove Huggle project config [puppet] - 10https://gerrit.wikimedia.org/r/1259079 [11:16:48] (03CR) 10Majavah: [C:03+2] cloudnfs: Remove Huggle project config [puppet] - 10https://gerrit.wikimedia.org/r/1259079 (owner: 10Majavah) [11:17:10] (03CR) 10Kamila Součková: Enable $wgTempCategoryCollations for s3 wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262091 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [11:22:06] (03CR) 10Hashar: "I have replied on the other change, we can't just use `profile::docker::engine` there are a bunch of other profiles that are needed :-]" [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: 10Hashar) [11:25:26] (03PS4) 10Hashar: ci: use docker.io package starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) [11:26:58] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264583 [11:37:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:39:37] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: 10Hashar) [11:43:13] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [11:48:02] (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [11:48:34] (03CR) 10Joal: [C:03+1] [EventStreamConfig] Add product_metrics.web_base.active_reader_baseline stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262303 (https://phabricator.wikimedia.org/T420621) (owner: 10TChin) [11:49:14] (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [11:51:13] (03PS1) 10Btullis: Update dummy keytabs to match the active list in puppet [labs/private] - 10https://gerrit.wikimedia.org/r/1264589 (https://phabricator.wikimedia.org/T421241) [11:51:26] !log bounce neutron-l3-agent on cloundnet1005 - T421054 [11:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:32] T421054: Move all openstack rabbitmq queues to quorum - https://phabricator.wikimedia.org/T421054 [11:52:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [11:53:28] (03CR) 10Btullis: [V:03+2 C:03+2] Update dummy keytabs to match the active list in puppet [labs/private] - 10https://gerrit.wikimedia.org/r/1264589 (https://phabricator.wikimedia.org/T421241) (owner: 10Btullis) [11:54:13] (03CR) 10Cathal Mooney: [C:03+2] Add policy 'transport-in' to apply as import on transport circuits [homer/public] - 10https://gerrit.wikimedia.org/r/1260734 (https://phabricator.wikimedia.org/T420821) (owner: 10Cathal Mooney) [11:54:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet [11:55:34] (03Merged) 10jenkins-bot: Add policy 'transport-in' to apply as import on transport circuits [homer/public] - 10https://gerrit.wikimedia.org/r/1260734 (https://phabricator.wikimedia.org/T420821) (owner: 10Cathal Mooney) [11:55:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:34] (03PS1) 10Michael Große: instrument(ReviseTone): record start of copyedit session [extensions/GrowthExperiments] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264590 (https://phabricator.wikimedia.org/T419181) [11:56:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264590 (https://phabricator.wikimedia.org/T419181) (owner: 10Michael Große) [11:57:02] 06SRE, 10MediaModeration, 06Product Safety and Integrity: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11763897 (10Ladsgroup) I think you're requesting the 330px standard size. Can you switch to 500px instead? That is the size that is bei... [11:58:37] 06SRE, 10MediaModeration, 06Product Safety and Integrity: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11763914 (10Ladsgroup) note that thumbnails don't get replicated across swift clusters, so changes to runtime after the switchover is a... [12:01:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet [12:01:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [12:03:52] !log apply transport-in policy to core router transport peerings to prefer local anycast routes [12:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:37] 06SRE, 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Access to Data Engineering Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11763945 (10BTullis) a:03BTullis [12:06:22] 06SRE, 10MediaModeration, 06Product Safety and Integrity: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11763978 (10Dreamy_Jazz) >>! In T421688#11763897, @Ladsgroup wrote: > I think you're requesting the 330px standard size. Can you switch... [12:07:28] (03PS15) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [12:07:36] (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [12:09:11] (03CR) 10CI reject: [V:04-1] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [12:14:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1261477 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester) [12:14:16] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11764012 (10MLechvien-WMF) @Blake did you use that in recent switchover? We didn't account for capacity in Q4 s... [12:14:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1262199 (https://phabricator.wikimedia.org/T421475) (owner: 10Jforrester) [12:15:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256432 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester) [12:17:26] (03CR) 10Effie Mouzeli: [C:03+1] "My knowledge is limited here:)" [puppet] - 10https://gerrit.wikimedia.org/r/1260723 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [12:17:32] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists via HyperKitty web ui - https://phabricator.wikimedia.org/T386559#11764020 (10Ladsgroup) We had another one right now: https://lists.wikimedia.org... [12:18:05] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11764022 (10Blake) @MLechvien-WMF This was not completed in time for the switchover. I'm in the middle of a sig... [12:21:22] 06SRE, 10MediaModeration, 06Product Safety and Integrity, 07Essential-Work: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11764031 (10Dreamy_Jazz) [12:22:15] 06SRE, 10MediaModeration, 06Product Safety and Integrity, 07Essential-Work: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11764035 (10Ladsgroup) >>! In T421688#11763978, @Dreamy_Jazz wrote: >>>! In T421688#11763897, @Ladsgroup wrote: >>... [12:23:02] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 3 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11764036 (10ABran-WMF) The new training flow keeps the existing VRTS export unchanged: `vrts.TicketExport2Mbox.pl` still produ... [12:26:09] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421643#11764053 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Moved lined to L1/L2 off L3 Sensor: Line, AA:L3, Current Value: 12.02 A (current) Thresholds: Hi... [12:26:45] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS2 Status - issue on ml-serve1015:9290 - https://phabricator.wikimedia.org/T421599#11764069 (10Jclark-ctr) a:03Jclark-ctr [12:27:21] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421527#11764071 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:29:37] (03PS3) 10EMcFarland: Instrumentation: Track clicks for user account menu experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264605 (https://phabricator.wikimedia.org/T418053) [12:30:02] (03CR) 10Vgutierrez: [C:03+2] prometheus::ops: Monitor IPIP realservers [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [12:30:51] (03CR) 10CI reject: [V:04-1] Instrumentation: Track clicks for user account menu experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264605 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland) [12:31:10] (03PS3) 10Fabfur: aptrepo,haproxy: add haproxy-awslc component/package [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) [12:31:35] (03PS9) 10Kgraessle: Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) [12:31:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264605 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland) [12:33:32] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [12:34:57] (03CR) 10Michael Große: "recheck" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264605 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland) [12:34:57] (03CR) 10Vgutierrez: [C:04-1] aptrepo,haproxy: add haproxy-awslc component/package (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [12:34:58] !log failover Ganeti master in ulsfo to ganeti4008 [12:35:01] PROBLEM - ganeti-wconfd running on ganeti4005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [12:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:26] (03PS1) 10EMcFarland: Display create account button in main menu when user is logged out. [skins/MinervaNeue] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264625 (https://phabricator.wikimedia.org/T418053) [12:41:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [skins/MinervaNeue] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264625 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland) [12:43:42] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS2 Status - issue on ml-serve1015:9290 - https://phabricator.wikimedia.org/T421599#11764146 (10Jclark-ctr) Reseated the power supply, but the error returned. I will open a Supermicro ticket and provide an update. [12:44:00] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11764147 (10Ajuanca) What's the simplest cookbook I can run to check the changes? I have tried with `sre.maps.roll-restart-reboot` but I get missing `/etc/cumin/config.yaml` [12:51:01] (03PS4) 10Fabfur: aptrepo,haproxy: add haproxy-awslc component/package [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) [12:51:17] (03PS1) 10Muehlenhoff: Registe rairflow-fr-tech-ops [puppet] - 10https://gerrit.wikimedia.org/r/1264628 (https://phabricator.wikimedia.org/T421703) [12:52:28] (03PS1) 10Vgutierrez: prometheus::ipip_exporter: Fix timer interval [puppet] - 10https://gerrit.wikimedia.org/r/1264629 (https://phabricator.wikimedia.org/T419873) [12:52:34] (03CR) 10Cparle: [C:03+1] Remove VP8 from transcoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254955 (https://phabricator.wikimedia.org/T413031) (owner: 10Ladsgroup) [12:52:46] (03CR) 10Elukey: "I like it, I added a few comment just to be sure!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1262196 (https://phabricator.wikimedia.org/T393053) (owner: 10JHathaway) [12:52:54] 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704 (10ayounsi) 03NEW [12:53:46] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1264629 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [12:56:06] (03PS1) 10Anne Tomasevich: Add event stream for logged-in reader retention experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264630 (https://phabricator.wikimedia.org/T420490) [12:58:16] (03CR) 10Elukey: "I have an ignorant question - does it mean that we'll get the same burrow metrics with the same values but with different "instance" label" [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [12:58:20] (03CR) 10Vgutierrez: [C:03+2] prometheus::ipip_exporter: Fix timer interval [puppet] - 10https://gerrit.wikimedia.org/r/1264629 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [12:58:24] (03CR) 10Elukey: [C:03+1] Switch our servers to use deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1256371 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [12:58:28] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS2 Status - issue on ml-serve1015:9290 - https://phabricator.wikimedia.org/T421599#11764305 (10Jclark-ctr) supermicro case #00107974 [12:59:22] 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11764339 (10ayounsi) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260330T1300). [13:00:05] kostajh, Raine, MichaelG_WMF, James_F, and eileen-m__: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] (03CR) 10Elukey: "looks good, what is the difference with using max()?" [alerts] - 10https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [13:00:15] * MichaelG_WMF is here 👋 [13:00:55] o/, I am omw back from a dr appointment, so I'd like to go last (and I can self deploy) [13:01:14] hi [13:01:41] Mind if I start? [13:02:24] kostajh please go ahead :) [13:02:49] ok [13:03:13] I will also need someone to deploy my change (I don't think there is a way to test it, it only adds server side instrumentation) [13:03:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264578 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [13:05:02] (03Merged) 10jenkins-bot: hCaptcha: Add APCu cache layer to health checker [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264578 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [13:05:12] !log disabling puppet on A:wikiube-worker-eqiad for T420436 [13:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:18] T420436: Migrate Wikikube k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420436 [13:05:21] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1264578|hCaptcha: Add APCu cache layer to health checker (T421204 T412947)]] [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:27] T412947: Reduce cache miss noise in memcached due to hcaptcha health checks - https://phabricator.wikimedia.org/T412947 [13:05:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki-root1002.eqiad.wmnet [13:06:50] (03CR) 10Fabfur: aptrepo,haproxy: add haproxy-awslc component/package (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [13:06:51] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [13:06:53] I'm ready whenever [13:07:12] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1264578|hCaptcha: Add APCu cache layer to health checker (T421204 T412947)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:07:27] (03CR) 10JMeybohm: [C:03+2] wikikube: Switch to IPIP mode on workers [puppet] - 10https://gerrit.wikimedia.org/r/1260723 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [13:08:19] 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11764400 (10ayounsi) [13:09:44] !log kharlan@deploy1003 kharlan: Continuing with sync [13:10:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet [13:12:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki-root1002.eqiad.wmnet [13:12:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet [13:14:38] (03PS1) 10Kgraessle: Set live configuration for Extension:PersonalDashboard on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264631 (https://phabricator.wikimedia.org/T421415) [13:15:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-etcd1003.eqiad.wmnet [13:15:40] My change is syncing out now. James_F are you able to help with syncing MichaelG_WMF ’s patch? [13:16:33] Sure. [13:16:52] And ideally also eileen-m__ patches later (they're still struggling with NickServ, but I can test them as well) [13:17:17] but if those get pushed to next time than that is also not the end of the world [13:17:17] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264578|hCaptcha: Add APCu cache layer to health checker (T421204 T412947)]] (duration: 11m 56s) [13:17:23] T412947: Reduce cache miss noise in memcached due to hcaptcha health checks - https://phabricator.wikimedia.org/T412947 [13:17:34] Ok, over to you James_F [13:17:40] MichaelG_WMF: Mine are pretty chunky. :-( I'll do yours, then mine, then Eileen's. [13:17:57] James_F: understood, thank you! 🙏 [13:18:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet [13:19:13] Hi, I'm sorry I'm late.  I had to register my nick with nickserv. [13:19:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-etcd1003.eqiad.wmnet [13:19:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet [13:19:47] hashar: Can you please not mass-abandon code during a deploy? You've broken the reporting bot due to the backlog. [13:20:06] (03CR) 10Elukey: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [13:26:26] 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11764615 (10ayounsi) [13:26:38] (03CR) 10Elukey: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [13:28:43] (03Abandoned) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [13:28:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264590 (https://phabricator.wikimedia.org/T419181) (owner: 10Michael Große) [13:28:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1261477 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester) [13:29:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1262199 (https://phabricator.wikimedia.org/T421475) (owner: 10Jforrester) [13:30:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-etcd1004.eqiad.wmnet [13:30:50] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1264590|instrument(ReviseTone): record start of copyedit session (T419181)]], [[gerrit:1261477|Replace WANObjectCache with new MemcachedWrapper concept (T419666)]], [[gerrit:1262199|Fix match case for setting minute, week or month TTL on OrchestratorRequest (T421475)]] [13:31:00] T419181: Update and Restart Revise Tone Experiment - https://phabricator.wikimedia.org/T419181 [13:31:00] T419666: WikiLambda: Replace direct usage of BagOStuff with WANObjectCache - https://phabricator.wikimedia.org/T419666 [13:31:00] T421475: OrchestratorRequest: fails setting ttl with UnhandledMatchError - https://phabricator.wikimedia.org/T421475 [13:31:16] Finally. [13:31:24] !log rebalance Ganeti cluster in ulsfo following the completion of the migration to routed Ganeti T421044 [13:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:29] T421044: ulsfo: balance VMs between all Ganeti nodes - https://phabricator.wikimedia.org/T421044 [13:32:33] !log jforrester@deploy1003 jforrester, migr: Backport for [[gerrit:1264590|instrument(ReviseTone): record start of copyedit session (T419181)]], [[gerrit:1261477|Replace WANObjectCache with new MemcachedWrapper concept (T419666)]], [[gerrit:1262199|Fix match case for setting minute, week or month TTL on OrchestratorRequest (T421475)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can [13:32:33] now be verified there. [13:33:28] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-jumbo1001.eqiad.wmnet with OS trixie [13:33:33] MichaelG_WMF: Can you check it's OK? Or is it not needed? [13:34:00] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host ganeti-jumbo1001 [13:34:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-etcd1004.eqiad.wmnet [13:34:15] James_F: I can quickly check [13:34:20] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:34:20] which server should I use? [13:34:59] Any mwdebug [13:35:09] 👍 [13:35:29] I'm cross-checking mw-debug-eqiad and mw-debug-codfw at my end. [13:35:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-etcd1005.eqiad.wmnet [13:35:55] James_F: looks all good on my side, no errors in the UI [13:36:09] !log jforrester@deploy1003 jforrester, migr: Continuing with sync [13:36:11] tested with mw-debug-eqiad [13:36:12] Ack. [13:36:52] 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11765005 (10ayounsi) [13:37:19] 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11765038 (10ayounsi) [13:37:23] (03Merged) 10jenkins-bot: instrument(ReviseTone): record start of copyedit session [extensions/GrowthExperiments] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264590 (https://phabricator.wikimedia.org/T419181) (owner: 10Michael Große) [13:37:27] (03Merged) 10jenkins-bot: Replace WANObjectCache with new MemcachedWrapper concept [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1261477 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester) [13:37:33] (03Merged) 10jenkins-bot: Fix match case for setting minute, week or month TTL on OrchestratorRequest [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1262199 (https://phabricator.wikimedia.org/T421475) (owner: 10Jforrester) [13:37:49] 06SRE, 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Access to Data Engineering Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11765041 (10JArguello-WMF) [13:38:15] Why thank you wikibugs for telling us about patches that merged 8 minutes ago. :-( [13:38:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262303 (https://phabricator.wikimedia.org/T420621) (owner: 10TChin) [13:38:31] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:38:49] 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11765138 (10ayounsi) [13:39:19] 06SRE, 10MediaModeration, 06Product Safety and Integrity, 07Essential-Work: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11765148 (10Dreamy_Jazz) >>! In T421688#11764035, @Ladsgroup wrote: >>>! In T421688#11763978, @Dreamy_Jazz wrote: >... [13:39:32] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:39:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-etcd1005.eqiad.wmnet [13:40:03] bking@cumin2002 reimage (PID 3025222) is awaiting input [13:40:09] 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11765184 (10ayounsi) [13:40:24] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264590|instrument(ReviseTone): record start of copyedit session (T419181)]], [[gerrit:1261477|Replace WANObjectCache with new MemcachedWrapper concept (T419666)]], [[gerrit:1262199|Fix match case for setting minute, week or month TTL on OrchestratorRequest (T421475)]] (duration: 09m 33s) [13:40:33] T419181: Update and Restart Revise Tone Experiment - https://phabricator.wikimedia.org/T419181 [13:40:33] T419666: WikiLambda: Replace direct usage of BagOStuff with WANObjectCache - https://phabricator.wikimedia.org/T419666 [13:40:33] T421475: OrchestratorRequest: fails setting ttl with UnhandledMatchError - https://phabricator.wikimedia.org/T421475 [13:41:44] 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11765237 (10ayounsi) [13:41:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256432 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester) [13:41:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast7002.wikimedia.org [13:42:01] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1256432|Wikifunctions: Switch cache from mcrouter-wikifunctions to special access (T419666)]] [13:42:11] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:42:31] (03Merged) 10jenkins-bot: Wikifunctions: Switch cache from mcrouter-wikifunctions to special access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256432 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester) [13:43:45] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1256432|Wikifunctions: Switch cache from mcrouter-wikifunctions to special access (T419666)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:44:17] 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11765301 (10ayounsi) [13:44:29] !log jforrester@deploy1003 Sync cancelled. [13:45:08] jclark@cumin1003 netbox (PID 3388826) is awaiting input [13:45:09] (03PS1) 10Jforrester: Revert "Wikifunctions: Switch cache from mcrouter-wikifunctions to special access" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264638 (https://phabricator.wikimedia.org/T411807) [13:45:23] (03CR) 10Jforrester: [C:03+2] Revert "Wikifunctions: Switch cache from mcrouter-wikifunctions to special access" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264638 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester) [13:45:34] jouncebot: nowandnext [13:45:34] For the next 0 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260330T1300) [13:45:34] In 0 hour(s) and 44 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260330T1430) [13:45:47] OK, over to eileen-m__49's patches. [13:45:51] Amir1: Not now, please. [13:46:02] noted. Mine is not urgent [13:46:11] After eileen-m__49 there's Raine's. [13:46:24] Ack [13:46:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264605 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland) [13:46:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [skins/MinervaNeue] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264625 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland) [13:46:31] (03Merged) 10jenkins-bot: Revert "Wikifunctions: Switch cache from mcrouter-wikifunctions to special access" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264638 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester) [13:47:26] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.check-ipip [13:47:37] Thank you! [13:47:46] Of course. Sorry it's taking so long. [13:47:57] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1328.eqiad.wmnet with OS trixie [13:48:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast7002.wikimedia.org [13:48:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org [13:48:25] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1328 [13:48:37] !log jayme@cumin1003 END (FAIL) - Cookbook sre.loadbalancer.check-ipip (exit_code=99) [13:48:39] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [13:49:00] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ganeti-jumbo1001 - bking@cumin2002" [13:49:06] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ganeti-jumbo1001 - bking@cumin2002" [13:49:07] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:49:07] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache ganeti-jumbo1001.eqiad.wmnet 140.48.64.10.in-addr.arpa 0.4.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:49:11] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ganeti-jumbo1001.eqiad.wmnet 140.48.64.10.in-addr.arpa 0.4.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:49:12] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti-jumbo1001 [13:49:35] (03PS1) 10CDanis: haproxy: CIDERGRINDER 🍎 globally 🚀🌍 [puppet] - 10https://gerrit.wikimedia.org/r/1264640 [13:51:36] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti-jumbo1001 [13:51:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ganeti-jumbo1001 [13:51:37] (03Merged) 10jenkins-bot: Instrumentation: Track clicks for user account menu experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264605 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland) [13:52:05] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11765346 (10Krd) I again cannot open https://lists.wikimedia.org/postorius/lists/daily-article-l.lists.wikimedia.org/... [13:52:50] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.check-ipip [13:53:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11765358 (10Jgreen) a:05Jgreen→03VRiley-WMF @VRiley-WMF I'm not having much luck with this box. Running into two more issues: - iDRAC not handling terminal correctly, arrow... [13:54:01] !log jayme@cumin1003 END (FAIL) - Cookbook sre.loadbalancer.check-ipip (exit_code=99) [13:54:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org [13:54:15] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1264640 (owner: 10CDanis) [13:54:28] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1328 - ayounsi@cumin1003" [13:54:34] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1328 - ayounsi@cumin1003" [13:54:34] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:54:34] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1328.eqiad.wmnet 129.32.64.10.in-addr.arpa 9.2.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:54:38] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1328.eqiad.wmnet 129.32.64.10.in-addr.arpa 9.2.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:54:39] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1328 [13:54:51] (03CR) 10Ladsgroup: "👊 🇺🇸 🔥" [puppet] - 10https://gerrit.wikimedia.org/r/1264640 (owner: 10CDanis) [13:55:00] Maybe we should ban Minerva patches from backports unless they're the only ones. They're always so slow. :-( [13:55:45] TIL!  I didn't know Minerva patches would be slower to deploy. [13:56:04] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1328 [13:56:04] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1328 [13:56:24] Yeah, Minerva CI is massive because the Readers group (reasonably) are worried about lots of different things in the interface. [13:56:35] Got it. [13:56:38] So anything that touches that repo runs a huge number of tests. [13:57:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:58:31] Finally. [13:58:44] (03Merged) 10jenkins-bot: Display create account button in main menu when user is logged out. [skins/MinervaNeue] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264625 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland) [13:59:17] !log enabling puppet on A:wikiube-worker-eqiad for T420436 [13:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:23] T420436: Migrate Wikikube k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420436 [13:59:45] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1264605|Instrumentation: Track clicks for user account menu experiment (T418053)]], [[gerrit:1264625|Display create account button in main menu when user is logged out. (T418053 T415647)]] [13:59:52] T418053: Add user account button to mobile web header: Instrumentation and experiment setup for first iteration A/B Test - https://phabricator.wikimedia.org/T418053 [13:59:52] T415647: Add "Create account" menu item to mobile web hamburger menu - https://phabricator.wikimedia.org/T415647 [14:00:36] will we have time for my change? or is the upcoming testkitchen window one we really shouldn't step on? [14:00:59] Raine: I don't know, sorry. But it's probably fine? Also Amir1 has something too. [14:01:11] 06SRE, 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Access to Data Engineering Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11765409 (10BTullis) I can pick this up and work with you on the details, to make sure tha... [14:01:22] eileen-m__49: For yours, can you check on mw-debug? It'll be there in a minute or so. [14:01:32] yeah... Amir1 is yours a config change? should we bundle them? I am only slightly nervous about mine :D [14:01:49] !log jforrester@deploy1003 emc-wmf, jforrester: Backport for [[gerrit:1264605|Instrumentation: Track clicks for user account menu experiment (T418053)]], [[gerrit:1264625|Display create account button in main menu when user is logged out. (T418053 T415647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:01:59] James_F Sure, I have the extension activated and will check once I see the change. [14:02:11] Excellent. Should be there now. [14:02:20] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-jumbo1001.eqiad.wmnet with reason: host reimage [14:02:25] (03CR) 10Vgutierrez: [C:03+1] "👍🍎" [puppet] - 10https://gerrit.wikimedia.org/r/1264640 (owner: 10CDanis) [14:03:19] Raine: don't worry about mine. I can do it later [14:03:22] many meetings [14:03:30] ok, thanks [14:03:33] Is now deploying from within a meeting. [14:03:41] Also, ^ /me etc. [14:04:03] James_F Is it fine if I use k8s-mwdebug for the server? [14:04:09] eileen-m__49: It's required. [14:04:21] Otherwise you will see the not-yet-deployed state. [14:04:27] * Raine is happily not deploying from a bus [14:07:52] eileen-m__49: Are we OK to deploy or should I roll back? [14:08:01] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1328.eqiad.wmnet with reason: host reimage [14:08:38] @James_F everything looks as expected on the page and I am looking at Flamegraph now.... [14:08:41] Ack. [14:08:50] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-jumbo1001.eqiad.wmnet with reason: host reimage [14:09:29] (03CR) 10JavierMonton: [V:03+1] stream: mediawiki.page_edit_type_simple.dev1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261695 (https://phabricator.wikimedia.org/T421005) (owner: 10AKhatun) [14:09:52] James_F I don't have a lot of experience with Excimer UI.  Is there anything in particular that I should look at to detect red flags?  Nothing looks strange, but I could be missing something. [14:10:07] eileen-m__49: I don't use Excimer at all, sorry. [14:10:08] (03CR) 10JavierMonton: [V:03+1] stream: mediawiki.page_edit_type_simple.dev1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261706 (https://phabricator.wikimedia.org/T421005) (owner: 10AKhatun) [14:11:40] ack [14:11:46] Let's just procede? [14:12:25] yes [14:12:27] We can proceed [14:12:29] !log jforrester@deploy1003 emc-wmf, jforrester: Continuing with sync [14:12:29] Thanks! [14:12:33] Of course. [14:12:38] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1328.eqiad.wmnet with reason: host reimage [14:13:57] (03CR) 10Jforrester: [C:03+1] "LGTM. Would love to use this soon! :-)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259222 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [14:16:42] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264605|Instrumentation: Track clicks for user account menu experiment (T418053)]], [[gerrit:1264625|Display create account button in main menu when user is logged out. (T418053 T415647)]] (duration: 16m 57s) [14:16:49] T418053: Add user account button to mobile web header: Instrumentation and experiment setup for first iteration A/B Test - https://phabricator.wikimedia.org/T418053 [14:16:49] T415647: Add "Create account" menu item to mobile web hamburger menu - https://phabricator.wikimedia.org/T415647 [14:17:51] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section es7 [14:18:08] OK, over to Raine. [14:18:20] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section es7 [14:19:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:19:22] wheee, thanks James_F [14:19:31] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 429561208 and 22 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:19:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kamila@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262091 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [14:19:54] (03CR) 10Jforrester: "Should I deploy this, or should I leave to one of you two? Other than confirming the service still runs and doesn't alert I'm not sure I'd" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus) [14:20:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:20:31] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2779888 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:20:41] (03Merged) 10jenkins-bot: Enable $wgTempCategoryCollations for s3 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262091 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [14:20:56] !log kamila@deploy1003 Started scap sync-world: Backport for [[gerrit:1262091|Enable $wgTempCategoryCollations for s3 wikis. (T419274 T419049)]] [14:21:06] T419274: ICU 72 upgrade: enable remote ICU collation writes - https://phabricator.wikimedia.org/T419274 [14:21:06] T419049: Upgrade the MediaWiki servers to ICU 72 ☂️ - https://phabricator.wikimedia.org/T419049 [14:22:41] !log kamila@deploy1003 kamila: Backport for [[gerrit:1262091|Enable $wgTempCategoryCollations for s3 wikis. (T419274 T419049)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:23:09] !log 💙cdanis@cumin1003.eqiad.wmnet ~ 🕥☕ sudo cumin 'A:cp' 'disable-puppet "cdanis CIDER 🍎"' [14:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:59] (03CR) 10Fabfur: [C:03+2] cache:haproxy: suppress startup warn for haproxy 3.2 (lua scripts) [puppet] - 10https://gerrit.wikimedia.org/r/1261484 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [14:24:07] (03CR) 10CDanis: [C:03+2] haproxy: CIDERGRINDER 🍎 globally 🚀🌍 [puppet] - 10https://gerrit.wikimedia.org/r/1264640 (owner: 10CDanis) [14:24:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:24:29] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-jumbo1001.eqiad.wmnet with OS trixie [14:25:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:26:44] !log kamila@deploy1003 kamila: Continuing with sync [14:27:28] (03PS4) 10Jforrester: wikifunctions: Slim down staging resources, and fix main staging config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 [14:27:28] (03CR) 10Jforrester: wikifunctions: Slim down staging resources, and fix main staging config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 (owner: 10Jforrester) [14:27:59] (03CR) 10Jforrester: wikifunctions: Slim down staging resources, and fix main staging config (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 (owner: 10Jforrester) [14:29:03] 10SRE-SLO: Sloth: productionize and onboard all SLOs - https://phabricator.wikimedia.org/T416262#11765564 (10herron) [14:29:20] 06SRE, 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11765567 (10BTullis) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260330T1430) [14:30:42] (03CR) 10Herron: [C:03+1] sre.o11y.thanos-compact-restart: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261375 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [14:30:54] 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11765579 (10Alberto) Thank you for the heads-up regarding the version. I am aware that 1.39 is now EOL. My main priority right now is recovering the connection to Commons to stabilize the s... [14:30:55] !log kamila@deploy1003 Finished scap sync-world: Backport for [[gerrit:1262091|Enable $wgTempCategoryCollations for s3 wikis. (T419274 T419049)]] (duration: 09m 59s) [14:31:03] T419274: ICU 72 upgrade: enable remote ICU collation writes - https://phabricator.wikimedia.org/T419274 [14:31:03] T419049: Upgrade the MediaWiki servers to ICU 72 ☂️ - https://phabricator.wikimedia.org/T419049 [14:31:09] (03CR) 10Vgutierrez: [C:03+1] hiera: upgrade haproxy to version 3.2 on cp6001 and cp6009 [puppet] - 10https://gerrit.wikimedia.org/r/1261492 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [14:31:10] (03CR) 10Herron: [C:03+1] prometheus/pop: consolidate the firewall provider declaration at the role level. [puppet] - 10https://gerrit.wikimedia.org/r/1264339 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli) [14:31:18] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1328.eqiad.wmnet with OS trixie [14:31:56] \o/ perfect timing :D [14:32:05] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1329.eqiad.wmnet with OS trixie [14:32:32] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1329 [14:32:40] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [14:32:52] (03PS5) 10Jforrester: wikifunctions: Bump up orchestrator resources + 2->4/4->6 CPU for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261344 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [14:32:57] !log jclark@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:33:21] (03CR) 10Herron: "Yes this is what I7d69fde5d9f2055d42c7b404828eebcff521f025 is for and that approach also will need to be applied to the related dashboards" [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [14:34:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:36:02] 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11765599 (10hnowlan) [14:36:30] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1329 - ayounsi@cumin1003" [14:36:36] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1329 - ayounsi@cumin1003" [14:36:36] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:36:36] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1329.eqiad.wmnet 132.32.64.10.in-addr.arpa 2.3.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:36:40] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1329.eqiad.wmnet 132.32.64.10.in-addr.arpa 2.3.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:36:40] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1329 [14:37:00] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1329 [14:37:00] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1329 [14:37:26] (03CR) 10Herron: "max() may alert slightly faster, since avg() would split differences between scrapes. I don't have a strong preference, but it'll be good" [alerts] - 10https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [14:39:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:39:24] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.check-ipip [14:40:01] 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11765643 (10BTullis) [14:40:12] !log jayme@cumin1003 END (FAIL) - Cookbook sre.loadbalancer.check-ipip (exit_code=99) [14:40:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:40:31] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 288163840 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:40:37] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261492 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [14:42:10] 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11765660 (10Reedy) 1.42 is not supported either ;) Note there's other instantcommons related changes that REL1_39 will be missing too. [14:42:31] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 57376 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:44:30] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:45:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:46:09] 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11765700 (10Alberto) Understood, thank you for the correction! I see that moving to a current LTS like 1.44 is the way to go to ensure full compatibility with InstantCommons and security. I... [14:49:08] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1329.eqiad.wmnet with reason: host reimage [14:49:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:49:35] (03PS1) 10Kamila Součková: Revert "Enable $wgTempCategoryCollations for s3 wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264651 [14:49:59] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.check-ipip [14:50:35] !log CIDERGRINDER 🍎 now deployed globally 🚀🌍 [14:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:52] 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11765744 (10LDlulisa-WMF) [14:50:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kamila@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264651 (owner: 10Kamila Součková) [14:51:09] !log jayme@cumin1003 END (FAIL) - Cookbook sre.loadbalancer.check-ipip (exit_code=99) [14:51:59] (03Merged) 10jenkins-bot: Revert "Enable $wgTempCategoryCollations for s3 wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264651 (owner: 10Kamila Součková) [14:52:15] !log kamila@deploy1003 Started scap sync-world: Backport for [[gerrit:1264651|Revert "Enable $wgTempCategoryCollations for s3 wikis."]] [14:53:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:54:00] !log kamila@deploy1003 kamila: Backport for [[gerrit:1264651|Revert "Enable $wgTempCategoryCollations for s3 wikis."]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:54:14] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1329.eqiad.wmnet with reason: host reimage [14:54:44] !log kamila@deploy1003 kamila: Continuing with sync [14:55:15] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11765780 (10Blake) Moving this to the backlog for now. [14:56:12] 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11765800 (10MoritzMuehlenhoff) [14:57:49] (03PS2) 10NMW03: Add delete-redirect to filemovers on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264652 (https://phabricator.wikimedia.org/T421373) [14:58:02] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T421517#11765822 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:58:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:58:32] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 295726216 and 28 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:58:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264652 (https://phabricator.wikimedia.org/T421373) (owner: 10NMW03) [14:58:58] !log kamila@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264651|Revert "Enable $wgTempCategoryCollations for s3 wikis."]] (duration: 06m 42s) [14:59:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:59:32] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2888536 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:00:23] (03PS1) 10Clare Ming: Add TestKitchenExposureResetEpoch config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264653 (https://phabricator.wikimedia.org/T414738) [15:01:42] !log depooling cp6001 and cp6009 to upgrade haproxy to v 3.2 (T421402) [15:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:47] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [15:02:48] (03CR) 10Ottomata: [C:03+2] eventstreams-internal - increase kafka max message size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254210 (https://phabricator.wikimedia.org/T420356) (owner: 10Ottomata) [15:02:57] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp6001.* [15:03:05] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp6009.* [15:03:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:03:58] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to version 3.2 on cp6001 and cp6009 [puppet] - 10https://gerrit.wikimedia.org/r/1261492 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [15:04:09] (03CR) 10Santiago Faci: [C:03+1] Add TestKitchenExposureResetEpoch config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264653 (https://phabricator.wikimedia.org/T414738) (owner: 10Clare Ming) [15:04:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet [15:05:16] (03Merged) 10jenkins-bot: eventstreams-internal - increase kafka max message size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254210 (https://phabricator.wikimedia.org/T420356) (owner: 10Ottomata) [15:05:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264653 (https://phabricator.wikimedia.org/T414738) (owner: 10Clare Ming) [15:06:11] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [15:06:32] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [15:06:41] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [15:07:54] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [15:07:57] (03PS2) 10Santiago Faci: Test Kitchen SLOs: Renaming slos because of the Test Kitchen renaming [puppet] - 10https://gerrit.wikimedia.org/r/1238312 (https://phabricator.wikimedia.org/T414381) [15:08:03] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [15:08:28] 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-admins for Jerrywang - https://phabricator.wikimedia.org/T419820#11765887 (10hnowlan) Thanks for handling this Ben! I'll remove the SRE tag to clear this from clinic duty for now, but please re-add it if you ne... [15:08:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet [15:09:31] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [15:09:58] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1329.eqiad.wmnet with OS trixie [15:10:00] 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11765905 (10Jclark-ctr) ` jclark@backup1012:~$ sudo dmidecode -s chassis-serial-number C826SFM12A50003 jclark@backup1012:~$ sudo dmidecode -s baseboard-serial-... [15:11:26] !log dancy@deploy1003 Started deploy [releng/jenkins-deploy@6f6a192] (releasing): Grant Overall/Administer to Arnaudb [15:12:15] !log dancy@deploy1003 Finished deploy [releng/jenkins-deploy@6f6a192] (releasing): Grant Overall/Administer to Arnaudb (duration: 01m 01s) [15:12:28] (03CR) 10Santiago Faci: Test Kitchen SLOs: Renaming slos because of the Test Kitchen renaming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1238312 (https://phabricator.wikimedia.org/T414381) (owner: 10Santiago Faci) [15:14:07] 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11765935 (10E.Enabulele) [15:17:32] (03PS1) 10Fabfur: haproxy: temporary removing haproxy3.2 specific conf [puppet] - 10https://gerrit.wikimedia.org/r/1264657 (https://phabricator.wikimedia.org/T421402) [15:17:39] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06Traffic: Decommission codfw cp hosts cp2027-cp2040 - https://phabricator.wikimedia.org/T419753#11765947 (10Jhancock.wm) 05In progress→03Resolved a:03Jhancock.wm [15:19:08] (03Abandoned) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [15:19:29] (03CR) 10Eevans: [C:03+2] cassandra_dev: upgrade to Cassanra 4.1.11 [puppet] - 10https://gerrit.wikimedia.org/r/1262310 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans) [15:20:08] (03Abandoned) 10Mmartorana: config: Enable EmailConfirmationBanner on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261526 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [15:22:14] 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11765994 (10Jclark-ctr) @Papaul I’m stuck on these. I’m assuming Supermicro swapped the motherboard or chassis before shipping to eqiad and didn’t update the s... [15:23:45] 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11766000 (10BTullis) [15:24:36] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v12.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1264659 [15:24:36] 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11766006 (10Aklapper) > As a result, our server’s IP address appears to have been blocked or heavily throttled. @Alberto: Hi, what makes you think so? Please provide exact error messages a... [15:26:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org [15:27:05] (03CR) 10Elukey: "Hi Santiago! The SLO working group is going to announce later on that Pyrra is being replaced by Sloth, a tool completely integrated in Gr" [puppet] - 10https://gerrit.wikimedia.org/r/1238312 (https://phabricator.wikimedia.org/T414381) (owner: 10Santiago Faci) [15:28:14] (03PS1) 10Mmartorana: config: Enable EmailConfirmationBanner on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264662 (https://phabricator.wikimedia.org/T421366) [15:29:03] (03CR) 10Federico Ceratto: "Updated code and functional tests to use api_client." [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [15:29:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264662 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [15:31:20] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v12.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1264659 (owner: 10Elukey) [15:31:47] !log upgrade cassandra-dev2001 to Cassandra 4.1.11 — T418417 [15:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:52] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [15:31:53] (03CR) 10Fabfur: [C:03+2] haproxy: temporary removing haproxy3.2 specific conf [puppet] - 10https://gerrit.wikimedia.org/r/1264657 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [15:32:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org [15:32:34] (03PS1) 10Muehlenhoff: Record LDAP access for atsuko [puppet] - 10https://gerrit.wikimedia.org/r/1264665 [15:32:42] 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11766070 (10RobH) This host was purchased 2024-08-07, so it is still under warranty. If Papaul doesn't know how to use the SUM (I've never used it) then the s... [15:33:03] (03PS3) 10D3r1ck01: Enable JWTs for OAuth1 consumers and OAuth2 owner-only consumers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260006 (https://phabricator.wikimedia.org/T417833) [15:34:22] (03PS1) 10Elukey: Upstream release v12.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1264666 [15:34:32] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for atsuko [puppet] - 10https://gerrit.wikimedia.org/r/1264665 (owner: 10Muehlenhoff) [15:34:40] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v12.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1264666 (owner: 10Elukey) [15:34:57] 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11766099 (10Jclark-ctr) This has already been verified already it is correct. the serial number label on the outside of the host shows 'S480845X4915849' [15:36:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [15:36:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org [15:37:16] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp6001*} and A:cp - 3.2 test upgrade () [15:38:18] (03PS1) 10Muehlenhoff: Record LDAP access for eenabulele [puppet] - 10https://gerrit.wikimedia.org/r/1264667 [15:38:49] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1330.eqiad.wmnet with OS trixie [15:39:17] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1330 [15:40:18] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [15:42:10] !log uploaded spicerack_12.3.0 to apt.wikimedia.org bookworm-wikimedia [15:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:24] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp6001*} and A:cp - 3.2 test upgrade () [15:42:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org [15:42:37] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp6009*} and A:cp - 3.2 test upgrade () [15:43:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [15:43:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2151 (T419635)', diff saved to https://phabricator.wikimedia.org/P89969 and previous config saved to /var/cache/conftool/dbconfig/20260330-154352-fceratto.json [15:43:58] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:44:06] !log upgrade spicerack on cumin2002 [15:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:13] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:45:56] ayounsi@cumin1003 reimage (PID 3589787) is awaiting input [15:46:28] FIRING: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [15:47:59] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp6009*} and A:cp - 3.2 test upgrade () [15:48:18] 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11766183 (10hnowlan) Hi Alberto, thanks for getting in touch about this. At present we have no blocks specific to Urbipedia or your specific IP address. However, it appears that you might b... [15:49:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:49:14] (03PS1) 10Elukey: profile::pki::intermediates: refresh discovery's public key [puppet] - 10https://gerrit.wikimedia.org/r/1264669 (https://phabricator.wikimedia.org/T420993) [15:51:08] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for eenabulele [puppet] - 10https://gerrit.wikimedia.org/r/1264667 (owner: 10Muehlenhoff) [15:51:27] !log repooling cp6001 and cp6009 (T421402) [15:51:34] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1330 - ayounsi@cumin1003" [15:51:40] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1330 - ayounsi@cumin1003" [15:51:40] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:51:40] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1330.eqiad.wmnet 163.48.64.10.in-addr.arpa 3.6.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:51:44] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1330.eqiad.wmnet 163.48.64.10.in-addr.arpa 3.6.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:51:44] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1330 [15:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:56] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp6009.* [15:51:56] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [15:52:00] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp6001.* [15:52:08] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1330 [15:52:08] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1330 [15:52:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T419635)', diff saved to https://phabricator.wikimedia.org/P89970 and previous config saved to /var/cache/conftool/dbconfig/20260330-155242-fceratto.json [15:52:49] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:54:38] (03PS1) 10Kamila Součková: Enable $wgTempCategoryCollations for s3 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264670 (https://phabricator.wikimedia.org/T419274) [15:55:42] (03CR) 10Kamila Součková: "Attempt #2 after revert due to T421732, hopefully this time with the correct number of `[]`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264670 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [15:59:06] !log rearmed keyholder on netmon* hosts following reboots [15:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:25] (03PS1) 10JMeybohm: machinetranslation: Remove networkpolicies for people* [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264671 (https://phabricator.wikimedia.org/T335491) [16:01:28] RESOLVED: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:01:38] (03CR) 10Santiago Faci: "Cool! Thanks for letting us know!" [puppet] - 10https://gerrit.wikimedia.org/r/1238312 (https://phabricator.wikimedia.org/T414381) (owner: 10Santiago Faci) [16:01:46] (03CR) 10Scott French: [C:03+1] "Thanks, Raine - This looks good. As an additional check, maybe it makes sense to `mw-debug-repl` into one of the in-scope wikis while in t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264670 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [16:02:47] (03CR) 10Kamila Součková: "That seems like an excellent idea :D Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264670 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [16:02:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P89971 and previous config saved to /var/cache/conftool/dbconfig/20260330-160251-fceratto.json [16:03:59] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1330.eqiad.wmnet with reason: host reimage [16:09:01] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1330.eqiad.wmnet with reason: host reimage [16:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:55] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11766321 (10Ladsgroup) Page previews is still requesting non-standard sizes still. For example, go to https://en.wikipedia.org/wiki/M... [16:11:34] (03CR) 10Elukey: "This needs to be coupled with ./modules/secret/secrets/pki/intermediates/discovery-key.pem in puppet private, I am now trying to figure ou" [puppet] - 10https://gerrit.wikimedia.org/r/1264669 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [16:11:45] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS trixie [16:13:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P89972 and previous config saved to /var/cache/conftool/dbconfig/20260330-161259-fceratto.json [16:22:06] (03CR) 10Elukey: "Ahh wait ok:" [puppet] - 10https://gerrit.wikimedia.org/r/1264669 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [16:23:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T419635)', diff saved to https://phabricator.wikimedia.org/P89973 and previous config saved to /var/cache/conftool/dbconfig/20260330-162307-fceratto.json [16:23:12] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:23:24] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [16:23:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2158 (T419635)', diff saved to https://phabricator.wikimedia.org/P89974 and previous config saved to /var/cache/conftool/dbconfig/20260330-162331-fceratto.json [16:24:58] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1330.eqiad.wmnet with OS trixie [16:32:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T419635)', diff saved to https://phabricator.wikimedia.org/P89975 and previous config saved to /var/cache/conftool/dbconfig/20260330-163239-fceratto.json [16:32:45] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:37:43] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1111.eqiad.wmnet with OS trixie [16:38:03] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS trixie [16:39:13] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:42:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11766443 (10Jgreen) All four are switched to UEFI and built. [16:42:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P89976 and previous config saved to /var/cache/conftool/dbconfig/20260330-164248-fceratto.json [16:44:13] RESOLVED: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:45:09] 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11766449 (10BTullis) @LDlulisa-WMF , @RThomas-WMF , @E.Enabulele - I think that the nex... [16:46:55] (03PS6) 10Jasmine: service::catalog: add sophroid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748)