[00:06:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:07:54] <icinga-wm>	 PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:16:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:17:54] <icinga-wm>	 RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:11:29] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1264140
[01:11:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1264140 (owner: 10TrainBranchBot)
[01:21:07] <wikibugs>	 (03CR) 10Zabe: [C:03+2] "retry" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1264118 (owner: 10TrainBranchBot)
[01:25:20] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1264140 (owner: 10TrainBranchBot)
[01:32:25] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1264118 (owner: 10TrainBranchBot)
[02:00:55] <logmsgbot>	 !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
[02:01:32] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:07:45] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 50s)
[02:09:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:16:38] <jinxer-wm>	 FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[02:34:13] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:01:32] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:11:33] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[03:55:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:01:32] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:02:32] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:37:06] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2026-03-25-072715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264221
[05:01:32] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:02:32] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:03:02] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 68810840 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:04:02] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3484120 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:14:31] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1022.eqiad.wmnet with reason: Downgrade clouddb1022 to 10.11.13
[05:16:09] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1022.eqiad.wmnet with reason: Downgrade clouddb1022 to 10.11.13
[05:16:17] <jinxer-wm>	 RESOLVED: [3x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[05:16:23] <jinxer-wm>	 RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[05:53:52] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264257
[05:56:40] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle)
[06:03:32] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:06:11] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus4002: clean up unused Hiera variables [puppet] - 10https://gerrit.wikimedia.org/r/1264265 (https://phabricator.wikimedia.org/T419430)
[06:06:13] <wikibugs>	 (03PS1) 10Tiziano Fogli: Switch prometheus3004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264266 (https://phabricator.wikimedia.org/T419960)
[06:06:13] <wikibugs>	 (03PS1) 10Tiziano Fogli: Switch prometheus5002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264267 (https://phabricator.wikimedia.org/T419960)
[06:06:14] <wikibugs>	 (03PS1) 10Tiziano Fogli: Switch prometheus6002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264268 (https://phabricator.wikimedia.org/T419960)
[06:06:15] <wikibugs>	 (03PS1) 10Tiziano Fogli: Switch prometheus7002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264269 (https://phabricator.wikimedia.org/T419960)
[06:07:56] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: adjust idleTimeout on Jetty [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T420909)
[06:08:37] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: adjust idleTimeout on Jetty [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T420909)
[06:09:02] <wikibugs>	 (03CR) 10Arnaudb: "sounds good to me, updated!" [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb)
[06:09:25] <wikibugs>	 (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb)
[06:13:23] <jinxer-wm>	 FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[06:15:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:44:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1264268 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli)
[06:45:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1264267 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli)
[06:45:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1264266 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli)
[06:46:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1264269 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli)
[06:47:33] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] aptrepo: updates configuration for haproxy32 [puppet] - 10https://gerrit.wikimedia.org/r/1262146 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur)
[06:47:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1264265 (https://phabricator.wikimedia.org/T419430) (owner: 10Tiziano Fogli)
[06:53:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix Cumin alias for ganeti-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1264296 (https://phabricator.wikimedia.org/T418993)
[06:55:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:57:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:58:11] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus4002: clean up unused Hiera variables [puppet] - 10https://gerrit.wikimedia.org/r/1264265 (https://phabricator.wikimedia.org/T419430) (owner: 10Tiziano Fogli)
[06:59:32] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] Switch prometheus3004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264266 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260330T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:02:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:03:41] <A_smart_kitten>	 hashar: if I got a last-minute patch up for T421458, would you be willing to review/deploy & run a throttle-resetting maintenance script? (pinging you as you mentioned previously in -operations to ping you if there's nobody around for deployments at this time of day, i hope that's okay :) )
[07:03:43] <stashbot>	 T421458: Lift IP cap on 2026-03-30 for Students Write Wikipedia course - cs.wikipedia - https://phabricator.wikimedia.org/T421458
[07:04:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:04:33] <tappof>	 !log prometheus3004: switch to nftables and reboot (T419960)
[07:04:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS trixie
[07:05:35] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11762861 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast4006.wikimedia.org w...
[07:08:06] <tappof>	 !log prometheus4003: reboot (T419960)
[07:08:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:42] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] Switch prometheus5002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264267 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli)
[07:11:00] <tappof>	 !log prometheus5002: switch to nftables and reboot (T419960)
[07:11:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:23] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] Switch prometheus6002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264268 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli)
[07:18:10] <tappof>	 !log prometheus6002: switch to nftables and reboot (T419960)
[07:18:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:44] <wikibugs>	 (03Abandoned) 10Arnaudb: Revert "gerrit: align ATS/Envoy/Apache timeouts" [puppet] - 10https://gerrit.wikimedia.org/r/1261961 (owner: 10Arnaudb)
[07:23:45] <anzx>	 quit
[07:24:37] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] Switch prometheus7002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1264269 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli)
[07:24:56] <tappof>	 !log prometheus7002: switch to nftables and reboot (T419960)
[07:25:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:45] <wikibugs>	 (03CR) 10JavierMonton: [V:03+1] eventstreams-internal - increase kafka max message size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254210 (https://phabricator.wikimedia.org/T420356) (owner: 10Ottomata)
[07:36:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1260135 (owner: 10Andrew Bogott)
[07:38:15] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host bast4006.wikimedia.org with OS trixie
[07:38:23] <jinxer-wm>	 RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[07:38:28] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11762891 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast4006.wikimedia.org with...
[07:39:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:39:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:39:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by javiermonton@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261377 (https://phabricator.wikimedia.org/T421341) (owner: 10JavierMonton)
[07:39:23] <jinxer-wm>	 FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[07:41:17] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:41:39] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261377 (https://phabricator.wikimedia.org/T421341) (owner: 10JavierMonton)
[07:42:31] <logmsgbot>	 !log javiermonton@deploy1003 Started scap sync-world: Backport for [[gerrit:1261377|stream: mediawiki.page_html_content_change (T421341)]]
[07:42:39] <stashbot>	 T421341: Update HTML pipeline schema - rendering_content_change - https://phabricator.wikimedia.org/T421341
[07:44:23] <jinxer-wm>	 RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[07:45:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS trixie
[07:45:33] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host bast4006.wikimedia.org with OS trixie
[07:45:45] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11762898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast4006.wikimedia.org w...
[07:45:47] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11762901 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast4006.wikimedia.org with...
[07:46:17] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:48:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS trixie
[07:48:32] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host bast4006.wikimedia.org with OS trixie
[07:48:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS trixie
[07:49:07] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261475 (https://phabricator.wikimedia.org/T421341) (owner: 10JavierMonton)
[07:49:12] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11762903 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast4006.wikimedia.org w...
[07:49:13] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:49:23] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] k8s.print-network-topology: Prevent SAL logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1259082 (owner: 10JMeybohm)
[07:50:35] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Nokia: BGP policy for unicast bgp sw_external outside peerings [homer/public] - 10https://gerrit.wikimedia.org/r/1262197 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney)
[07:50:53] <jinxer-wm>	 FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[07:51:31] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261475 (https://phabricator.wikimedia.org/T421341) (owner: 10JavierMonton)
[07:51:32] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:54:04] <wikibugs>	 (03Merged) 10jenkins-bot: k8s.print-network-topology: Prevent SAL logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1259082 (owner: 10JMeybohm)
[07:54:42] <godog>	 !log deploy rabbitmq changes to allow cli communication - T420923
[07:54:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:47] <stashbot>	 T420923: rabbitmqctl list_queues in eqiad/codfw times out after 60s - https://phabricator.wikimedia.org/T420923
[07:54:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] rabbitmq: enable cli tools peers communication [puppet] - 10https://gerrit.wikimedia.org/r/1261366 (https://phabricator.wikimedia.org/T420923) (owner: 10Filippo Giunchedi)
[07:55:19] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "I didn't know things were broken. +1 to manually fix it and +1 to that change." [homer/public] - 10https://gerrit.wikimedia.org/r/1260734 (https://phabricator.wikimedia.org/T420821) (owner: 10Cathal Mooney)
[07:55:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:55:53] <jinxer-wm>	 RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[07:56:32] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[08:00:23] <logmsgbot>	 !log javiermonton@deploy1003 javiermonton: Backport for [[gerrit:1261377|stream: mediawiki.page_html_content_change (T421341)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:00:30] <stashbot>	 T421341: Update HTML pipeline schema - rendering_content_change - https://phabricator.wikimedia.org/T421341
[08:00:53] <jinxer-wm>	 FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[08:02:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[08:02:18] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rabbitmq: fix firewall port range for cli tools [puppet] - 10https://gerrit.wikimedia.org/r/1264347 (https://phabricator.wikimedia.org/T420923)
[08:03:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] rabbitmq: fix firewall port range for cli tools [puppet] - 10https://gerrit.wikimedia.org/r/1264347 (https://phabricator.wikimedia.org/T420923) (owner: 10Filippo Giunchedi)
[08:03:22] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Fix Cumin alias for ganeti-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1264296 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[08:03:49] <logmsgbot>	 !log javiermonton@deploy1003 javiermonton: Continuing with sync
[08:14:05] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rabbitmq: make server erlang distribution listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1264355 (https://phabricator.wikimedia.org/T420923)
[08:14:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] rabbitmq: make server erlang distribution listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1264355 (https://phabricator.wikimedia.org/T420923) (owner: 10Filippo Giunchedi)
[08:14:48] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] trafficserver: 100% of /feed/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1260690 (https://phabricator.wikimedia.org/T421233) (owner: 10Clément Goubert)
[08:15:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Fix Cumin alias for ganeti-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1264296 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[08:17:42] <logmsgbot>	 !log javiermonton@deploy1003 Finished scap sync-world: Backport for [[gerrit:1261377|stream: mediawiki.page_html_content_change (T421341)]] (duration: 35m 10s)
[08:17:50] <stashbot>	 T421341: Update HTML pipeline schema - rendering_content_change - https://phabricator.wikimedia.org/T421341
[08:18:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.73% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:23:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.73% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:25:17] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rabbitmq: use correct erlang distribution ports on firewall [puppet] - 10https://gerrit.wikimedia.org/r/1264360 (https://phabricator.wikimedia.org/T420923)
[08:25:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] rabbitmq: use correct erlang distribution ports on firewall [puppet] - 10https://gerrit.wikimedia.org/r/1264360 (https://phabricator.wikimedia.org/T420923) (owner: 10Filippo Giunchedi)
[08:26:29] <wikibugs>	 (03PS3) 10Tiziano Fogli: prometheus/pop: consolidate the firewall provider declaration at the role level. [puppet] - 10https://gerrit.wikimedia.org/r/1264339 (https://phabricator.wikimedia.org/T419960)
[08:29:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1264339 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli)
[08:32:01] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: update doc links in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264365 (https://phabricator.wikimedia.org/T406369)
[08:34:18] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: update doc links in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264365 (https://phabricator.wikimedia.org/T406369) (owner: 10Ilias Sarantopoulos)
[08:34:19] <wikibugs>	 06SRE, 06ServiceOps new, 07Datacenter-Switchover: Increased rate of badtoken errors / session store issues due to datacenter switchover? - https://phabricator.wikimedia.org/T421168#11762999 (10MLechvien-WMF) Hi,  The correlation with DC Switchover does not seem obvious, for example it seems there was another...
[08:34:37] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host bast4006.wikimedia.org with OS trixie
[08:35:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:35:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:37:18] <tappof>	 !log prometheus[12]005: reboot (T419960)
[08:37:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:56] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update doc links in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264365 (https://phabricator.wikimedia.org/T406369) (owner: 10Ilias Sarantopoulos)
[08:38:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:38:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:38:13] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[08:38:19] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[08:39:24] <icinga-wm>	 PROBLEM - Host prometheus2005 is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:46] <wikibugs>	 (03PS1) 10Ayounsi: Management routers: move standard security_zones to roles.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1264367 (https://phabricator.wikimedia.org/T421674)
[08:40:44] <icinga-wm>	 PROBLEM - Host prometheus1005 is DOWN: PING CRITICAL - Packet loss = 100%
[08:40:58] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:41:03] <wikibugs>	 (03PS1) 10MVernon: swift: drain 3 codfw nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1264368 (https://phabricator.wikimedia.org/T354872)
[08:43:02] <icinga-wm>	 RECOVERY - Host prometheus2005 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms
[08:43:44] <icinga-wm>	 RECOVERY - Host prometheus1005 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[08:45:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:45:53] <jinxer-wm>	 RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[08:45:58] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:46:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:47:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[08:47:20] <wikibugs>	 (03CR) 10Elukey: profile::base::certificates: rename Puppet Internal CA's path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1262055 (owner: 10Elukey)
[08:49:44] <wikibugs>	 10ops-magru: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T419298#11763091 (10ayounsi) p:05Triage→03Low a:03ayounsi Ultimately Juniper, I'll take the task for now.
[08:50:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.411s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:50:53] <jinxer-wm>	 FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[08:50:58] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:51:25] <tappof>	 !log prometheus[12]007: reboot (T419960)
[08:51:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[08:52:37] <XioNoX>	 !log push pfw policy - T421556
[08:52:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:18] <icinga-wm>	 PROBLEM - Host prometheus1007 is DOWN: PING CRITICAL - Packet loss = 100%
[08:53:18] <icinga-wm>	 PROBLEM - Host prometheus2007 is DOWN: PING CRITICAL - Packet loss = 100%
[08:55:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.411s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:55:54] <icinga-wm>	 RECOVERY - Host prometheus1007 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms
[08:55:54] <icinga-wm>	 RECOVERY - Host prometheus2007 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms
[08:56:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS trixie
[09:00:30] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:00:53] <jinxer-wm>	 RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[09:00:55] <wikibugs>	 (03CR) 10Elukey: [C:03+2] java: add java-21-security erb template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1251836 (https://phabricator.wikimedia.org/T420083) (owner: 10Elukey)
[09:02:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:03:32] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:05:53] <jinxer-wm>	 FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[09:07:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:10:51] <tappof>	 !log prometheus[12]006: reboot (T419960)
[09:10:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:46] <tappof>	 !log prometheus[12]008: reboot (T419960)
[09:11:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:02] <icinga-wm>	 PROBLEM - SSH on prometheus2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:13:18] <icinga-wm>	 PROBLEM - Host prometheus1006 is DOWN: PING CRITICAL - Packet loss = 100%
[09:13:24] <icinga-wm>	 PROBLEM - Host prometheus1008 is DOWN: PING CRITICAL - Packet loss = 100%
[09:14:02] <icinga-wm>	 PROBLEM - SSH on prometheus2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:14:58] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 12200
[09:15:44] <icinga-wm>	 RECOVERY - Host prometheus1006 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms
[09:15:51] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12200
[09:15:54] <icinga-wm>	 RECOVERY - Host prometheus1008 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[09:15:58] <icinga-wm>	 PROBLEM - Host prometheus2008 is DOWN: PING CRITICAL - Packet loss = 100%
[09:15:58] <icinga-wm>	 PROBLEM - Host prometheus2006 is DOWN: PING CRITICAL - Packet loss = 100%
[09:16:46] <icinga-wm>	 RECOVERY - Host prometheus2008 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms
[09:16:46] <icinga-wm>	 RECOVERY - Host prometheus2006 is UP: PING OK - Packet loss = 0%, RTA = 30.47 ms
[09:16:52] <icinga-wm>	 RECOVERY - SSH on prometheus2008 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:16:52] <icinga-wm>	 RECOVERY - SSH on prometheus2006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:17:15] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 42
[09:17:33] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Nice, thats a lot cleaner, and good reason to standardise interface usage which I’d not really thought was too important :)" [homer/public] - 10https://gerrit.wikimedia.org/r/1264367 (https://phabricator.wikimedia.org/T421674) (owner: 10Ayounsi)
[09:18:28] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:18:38] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Management routers: move standard security_zones to roles.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1264367 (https://phabricator.wikimedia.org/T421674) (owner: 10Ayounsi)
[09:18:43] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:18:45] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414970#11763162 (10BTullis) 05Open→03Resolved a:03BTullis I'm resolving this ticket, since it is historical.  This is one of the cases where we wish...
[09:19:59] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 42
[09:20:49] <wikibugs>	 (03Merged) 10jenkins-bot: Management routers: move standard security_zones to roles.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1264367 (https://phabricator.wikimedia.org/T421674) (owner: 10Ayounsi)
[09:22:21] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Cookbook wise LGTM, I'll leave it to your team for the thanos-specific bits :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1261375 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli)
[09:23:28] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:26:19] <wikibugs>	 (03PS2) 10Filippo Giunchedi: openstack: enable rabbit transient quorum queues [puppet] - 10https://gerrit.wikimedia.org/r/1261374 (https://phabricator.wikimedia.org/T421054)
[09:26:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: openstack: enable trove-guestagent rabbit transient quorum queues [puppet] - 10https://gerrit.wikimedia.org/r/1264557 (https://phabricator.wikimedia.org/T421054)
[09:29:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Indeed testing in codfw first SGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1261374 (https://phabricator.wikimedia.org/T421054) (owner: 10Filippo Giunchedi)
[09:42:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:42:27] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host bast4006.wikimedia.org with OS trixie
[09:45:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] openstack: enable rabbit transient quorum queues [puppet] - 10https://gerrit.wikimedia.org/r/1261374 (https://phabricator.wikimedia.org/T421054) (owner: 10Filippo Giunchedi)
[09:46:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS bookworm
[09:47:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:55:46] <wikibugs>	 06SRE, 07SRE-Unowned, 06WMF-Legal, 07SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437#11763322 (10Bugreporter) 05Open→03Declined Close as declined since WMF plans to shut down all Wikinewses (https://meta.wiki...
[09:55:46] <wikibugs>	 (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1260716 (owner: 10Majavah)
[09:55:53] <jinxer-wm>	 RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[09:57:12] <wikibugs>	 (03PS2) 10Fabfur: aptrepo,haproxy: add haproxy-awslc component/package [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825)
[09:57:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[10:00:03] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8361/console" [puppet] - 10https://gerrit.wikimedia.org/r/1260716 (owner: 10Majavah)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260330T1000)
[10:00:30] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:02:12] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "`modules/haproxy/manifests/init.pp` needs to be updated to install `haproxy-awslc` instead of `haproxy` package" [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[10:04:17] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/1260716 (owner: 10Majavah)
[10:05:05] <wikibugs>	 06SRE, 07SRE-Unowned, 07SEO: Index pl.wikinews in Google Publisher Center - https://phabricator.wikimedia.org/T393288#11763372 (10Bugreporter) 05Open→03Declined Close as declined since WMF plans to shut down all Wikinewses (https://meta.wikimedia.org/w/index.php?title=Wikimedia_Foundation_Board_notic...
[10:05:50] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] swift: drain 3 codfw nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1264368 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[10:06:20] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: drain 3 codfw nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1264368 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[10:08:51] <wikibugs>	 06SRE: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988#11763403 (10Bugreporter) 05Stalled→03Declined Close as declined since WMF plans to shut down all Wikinewses (https://meta.wikimedia.org/w/index.php?title=Wikimedia_Foundation_Board...
[10:08:56] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[10:09:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast4006.wikimedia.org with reason: host reimage
[10:13:16] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add an LDAP group to the list considered during offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1261481 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis)
[10:14:56] <wikibugs>	 (03PS6) 10Btullis: Route dse-k8s API blackbox checks to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264)
[10:14:59] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11763431 (10MatthewVernon)
[10:15:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast4006.wikimedia.org with reason: host reimage
[10:18:41] <wikibugs>	 (03PS19) 10Vgutierrez: prometheus::ops: Monitor IPIP realservers [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873)
[10:19:05] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez)
[10:21:36] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] Start reading from new file table in dewiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264110 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[10:23:26] <wikibugs>	 (03PS1) 101F616EMO: zhwikinews: 20th anniversary logo change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264569 (https://phabricator.wikimedia.org/T420165)
[10:28:39] <wikibugs>	 (03PS1) 101F616EMO: Revert "zhwikinews: 20th anniversary logo change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165)
[10:28:53] <wikibugs>	 (03CR) 10Btullis: wdqs-queryhammer: Deployment fixes (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[10:30:04] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis)
[10:33:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264569 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO)
[10:36:11] <wikibugs>	 06SRE, 10MediaModeration, 06Product Safety and Integrity: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11763501 (10mszwarc)
[10:37:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast4006.wikimedia.org with OS bookworm
[10:38:02] <wikibugs>	 06SRE, 10MediaModeration, 06Product Safety and Integrity: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11763503 (10mszwarc)
[10:38:25] <wikibugs>	 06SRE, 10MediaModeration, 06Product Safety and Integrity: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11763504 (10mszwarc)
[10:40:35] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Route dse-k8s API blackbox checks to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis)
[10:43:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: openstack: set oslo lock path where missing [puppet] - 10https://gerrit.wikimedia.org/r/1264577 (https://phabricator.wikimedia.org/T421054)
[10:44:16] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Add APCu cache layer to health checker [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264578 (https://phabricator.wikimedia.org/T421204)
[10:44:27] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264578 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan)
[10:45:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] openstack: set oslo lock path where missing [puppet] - 10https://gerrit.wikimedia.org/r/1264577 (https://phabricator.wikimedia.org/T421054) (owner: 10Filippo Giunchedi)
[10:45:15] <wikibugs>	 (03PS2) 10Filippo Giunchedi: openstack: set oslo lock path where missing [puppet] - 10https://gerrit.wikimedia.org/r/1264577 (https://phabricator.wikimedia.org/T421054)
[10:45:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] openstack: set oslo lock path where missing [puppet] - 10https://gerrit.wikimedia.org/r/1264577 (https://phabricator.wikimedia.org/T421054) (owner: 10Filippo Giunchedi)
[10:45:58] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262091 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková)
[10:48:24] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Apply the new VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis)
[10:53:43] <wikibugs>	 (03PS6) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259946 (https://phabricator.wikimedia.org/T418146)
[10:53:51] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259946 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[10:54:14] <wikibugs>	 06SRE, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11763577 (10BTullis)
[10:56:27] <wikibugs>	 (03Merged) 10jenkins-bot: Apply the new VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis)
[11:04:28] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[11:05:05] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[11:05:35] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[11:05:38] <wikibugs>	 (03PS1) 10Ladsgroup: Switch from InterwikiSortingPrepend to the ULS config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264581
[11:06:01] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[11:07:22] <wikibugs>	 (03PS2) 10Majavah: cephadm::target: Convert port to an integer [puppet] - 10https://gerrit.wikimedia.org/r/1260716
[11:07:56] <wikibugs>	 (03CR) 10Majavah: [C:03+2] cephadm::target: Convert port to an integer [puppet] - 10https://gerrit.wikimedia.org/r/1260716 (owner: 10Majavah)
[11:11:25] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "shall we do it?" [puppet] - 10https://gerrit.wikimedia.org/r/1242430 (owner: 10Muehlenhoff)
[11:12:23] <jinxer-wm>	 FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[11:15:20] <wikibugs>	 (03PS20) 10Vgutierrez: prometheus::ops: Monitor IPIP realservers [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873)
[11:15:57] <wikibugs>	 (03CR) 10Vgutierrez: "produced metrics with PS20 python script: https://phabricator.wikimedia.org/P89966" [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez)
[11:15:59] <wikibugs>	 (03PS2) 10Majavah: cloudnfs: Remove Huggle project config [puppet] - 10https://gerrit.wikimedia.org/r/1259079
[11:16:48] <wikibugs>	 (03CR) 10Majavah: [C:03+2] cloudnfs: Remove Huggle project config [puppet] - 10https://gerrit.wikimedia.org/r/1259079 (owner: 10Majavah)
[11:17:10] <wikibugs>	 (03CR) 10Kamila Součková: Enable $wgTempCategoryCollations for s3 wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262091 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková)
[11:22:06] <wikibugs>	 (03CR) 10Hashar: "I have replied on the other change, we can't just use `profile::docker::engine` there are a bunch of other profiles that are needed :-]" [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: 10Hashar)
[11:25:26] <wikibugs>	 (03PS4) 10Hashar: ci: use docker.io package starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109)
[11:26:58] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264583
[11:37:23] <jinxer-wm>	 RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[11:39:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: 10Hashar)
[11:43:13] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez)
[11:48:02] <wikibugs>	 (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski)
[11:48:34] <wikibugs>	 (03CR) 10Joal: [C:03+1] [EventStreamConfig] Add product_metrics.web_base.active_reader_baseline stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262303 (https://phabricator.wikimedia.org/T420621) (owner: 10TChin)
[11:49:14] <wikibugs>	 (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski)
[11:51:13] <wikibugs>	 (03PS1) 10Btullis: Update dummy keytabs to match the active list in puppet [labs/private] - 10https://gerrit.wikimedia.org/r/1264589 (https://phabricator.wikimedia.org/T421241)
[11:51:26] <godog>	 !log bounce neutron-l3-agent on cloundnet1005 - T421054
[11:51:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:32] <stashbot>	 T421054: Move all openstack rabbitmq queues to quorum - https://phabricator.wikimedia.org/T421054
[11:52:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet
[11:53:28] <wikibugs>	 (03CR) 10Btullis: [V:03+2 C:03+2] Update dummy keytabs to match the active list in puppet [labs/private] - 10https://gerrit.wikimedia.org/r/1264589 (https://phabricator.wikimedia.org/T421241) (owner: 10Btullis)
[11:54:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add policy 'transport-in' to apply as import on transport circuits [homer/public] - 10https://gerrit.wikimedia.org/r/1260734 (https://phabricator.wikimedia.org/T420821) (owner: 10Cathal Mooney)
[11:54:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet
[11:55:34] <wikibugs>	 (03Merged) 10jenkins-bot: Add policy 'transport-in' to apply as import on transport circuits [homer/public] - 10https://gerrit.wikimedia.org/r/1260734 (https://phabricator.wikimedia.org/T420821) (owner: 10Cathal Mooney)
[11:55:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:56:34] <wikibugs>	 (03PS1) 10Michael Große: instrument(ReviseTone): record start of copyedit session [extensions/GrowthExperiments] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264590 (https://phabricator.wikimedia.org/T419181)
[11:56:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264590 (https://phabricator.wikimedia.org/T419181) (owner: 10Michael Große)
[11:57:02] <wikibugs>	 06SRE, 10MediaModeration, 06Product Safety and Integrity: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11763897 (10Ladsgroup) I think you're requesting the 330px standard size. Can you switch to 500px instead? That is the size that is bei...
[11:58:37] <wikibugs>	 06SRE, 10MediaModeration, 06Product Safety and Integrity: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11763914 (10Ladsgroup) note that thumbnails don't get replicated across swift clusters, so changes to runtime after the switchover is a...
[12:01:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet
[12:01:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet
[12:03:52] <topranks>	 !log apply transport-in policy to core router transport peerings to prefer local anycast routes 
[12:03:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:37] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Access to Data Engineering Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11763945 (10BTullis) a:03BTullis
[12:06:22] <wikibugs>	 06SRE, 10MediaModeration, 06Product Safety and Integrity: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11763978 (10Dreamy_Jazz) >>! In T421688#11763897, @Ladsgroup wrote: > I think you're requesting the 330px standard size. Can you switch...
[12:07:28] <wikibugs>	 (03PS15) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722)
[12:07:36] <wikibugs>	 (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski)
[12:09:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski)
[12:14:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1261477 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester)
[12:14:16] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11764012 (10MLechvien-WMF) @Blake did you use that in recent switchover? We didn't account for capacity in Q4 s...
[12:14:26] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1262199 (https://phabricator.wikimedia.org/T421475) (owner: 10Jforrester)
[12:15:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256432 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester)
[12:17:26] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "My knowledge is limited here:)" [puppet] - 10https://gerrit.wikimedia.org/r/1260723 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm)
[12:17:32] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists via HyperKitty web ui - https://phabricator.wikimedia.org/T386559#11764020 (10Ladsgroup) We had another one right now: https://lists.wikimedia.org...
[12:18:05] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11764022 (10Blake) @MLechvien-WMF This was not completed in time for the switchover. I'm in the middle of a sig...
[12:21:22] <wikibugs>	 06SRE, 10MediaModeration, 06Product Safety and Integrity, 07Essential-Work: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11764031 (10Dreamy_Jazz)
[12:22:15] <wikibugs>	 06SRE, 10MediaModeration, 06Product Safety and Integrity, 07Essential-Work: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11764035 (10Ladsgroup) >>! In T421688#11763978, @Dreamy_Jazz wrote: >>>! In T421688#11763897, @Ladsgroup wrote: >>...
[12:23:02] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 3 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11764036 (10ABran-WMF) The new training flow keeps the existing VRTS export unchanged: `vrts.TicketExport2Mbox.pl` still produ...
[12:26:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421643#11764053 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Moved lined to L1/L2 off L3  Sensor: Line, AA:L3, Current Value: 12.02 A (current) Thresholds: Hi...
[12:26:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS2 Status - issue on ml-serve1015:9290 - https://phabricator.wikimedia.org/T421599#11764069 (10Jclark-ctr) a:03Jclark-ctr
[12:27:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421527#11764071 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[12:29:37] <wikibugs>	 (03PS3) 10EMcFarland: Instrumentation: Track clicks for user account menu experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264605 (https://phabricator.wikimedia.org/T418053)
[12:30:02] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] prometheus::ops: Monitor IPIP realservers [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez)
[12:30:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Instrumentation: Track clicks for user account menu experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264605 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland)
[12:31:10] <wikibugs>	 (03PS3) 10Fabfur: aptrepo,haproxy: add haproxy-awslc component/package [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825)
[12:31:35] <wikibugs>	 (03PS9) 10Kgraessle: Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485)
[12:31:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264605 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland)
[12:33:32] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[12:34:57] <wikibugs>	 (03CR) 10Michael Große: "recheck" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264605 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland)
[12:34:57] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] aptrepo,haproxy: add haproxy-awslc component/package (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[12:34:58] <moritzm>	 !log failover Ganeti master in ulsfo to ganeti4008
[12:35:01] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti4005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[12:35:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:26] <wikibugs>	 (03PS1) 10EMcFarland: Display create account button in main menu when user is logged out. [skins/MinervaNeue] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264625 (https://phabricator.wikimedia.org/T418053)
[12:41:08] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [skins/MinervaNeue] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264625 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland)
[12:43:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS2 Status - issue on ml-serve1015:9290 - https://phabricator.wikimedia.org/T421599#11764146 (10Jclark-ctr) Reseated the power supply, but the error returned. I will open a Supermicro ticket and provide an update.
[12:44:00] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11764147 (10Ajuanca) What's the simplest cookbook I can run to check the changes? I have tried with `sre.maps.roll-restart-reboot` but I get missing `/etc/cumin/config.yaml`
[12:51:01] <wikibugs>	 (03PS4) 10Fabfur: aptrepo,haproxy: add haproxy-awslc component/package [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825)
[12:51:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Registe rairflow-fr-tech-ops [puppet] - 10https://gerrit.wikimedia.org/r/1264628 (https://phabricator.wikimedia.org/T421703)
[12:52:28] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus::ipip_exporter: Fix timer interval [puppet] - 10https://gerrit.wikimedia.org/r/1264629 (https://phabricator.wikimedia.org/T419873)
[12:52:34] <wikibugs>	 (03CR) 10Cparle: [C:03+1] Remove VP8 from transcoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254955 (https://phabricator.wikimedia.org/T413031) (owner: 10Ladsgroup)
[12:52:46] <wikibugs>	 (03CR) 10Elukey: "I like it, I added a few comment just to be sure!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1262196 (https://phabricator.wikimedia.org/T393053) (owner: 10JHathaway)
[12:52:54] <wikibugs>	 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704 (10ayounsi) 03NEW
[12:53:46] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1264629 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez)
[12:56:06] <wikibugs>	 (03PS1) 10Anne Tomasevich: Add event stream for logged-in reader retention experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264630 (https://phabricator.wikimedia.org/T420490)
[12:58:16] <wikibugs>	 (03CR) 10Elukey: "I have an ignorant question - does it mean that we'll get the same burrow metrics with the same values but with different "instance" label" [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron)
[12:58:20] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] prometheus::ipip_exporter: Fix timer interval [puppet] - 10https://gerrit.wikimedia.org/r/1264629 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez)
[12:58:24] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Switch our servers to use deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1256371 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff)
[12:58:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS2 Status - issue on ml-serve1015:9290 - https://phabricator.wikimedia.org/T421599#11764305 (10Jclark-ctr) supermicro case #00107974
[12:59:22] <wikibugs>	 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11764339 (10ayounsi)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260330T1300).
[13:00:05] <jouncebot>	 kostajh, Raine, MichaelG_WMF, James_F, and eileen-m__: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:06] <wikibugs>	 (03CR) 10Elukey: "looks good, what is the difference with using max()?" [alerts] - 10https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron)
[13:00:15] * MichaelG_WMF is here 👋
[13:00:55] <Raine>	 o/, I am omw back from a dr appointment, so I'd like to go last (and I can self deploy) 
[13:01:14] <kostajh>	 hi
[13:01:41] <kostajh>	 Mind if I start?
[13:02:24] <MichaelG_WMF>	 kostajh please go ahead :)
[13:02:49] <kostajh>	 ok
[13:03:13] <MichaelG_WMF>	 I will also need someone to deploy my change (I don't think there is a way to test it, it only adds server side instrumentation)
[13:03:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264578 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan)
[13:05:02] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Add APCu cache layer to health checker [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264578 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan)
[13:05:12] <jayme>	 !log disabling puppet on A:wikiube-worker-eqiad for T420436
[13:05:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:18] <stashbot>	 T420436: Migrate Wikikube k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420436
[13:05:21] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1264578|hCaptcha: Add APCu cache layer to health checker (T421204 T412947)]]
[13:05:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:05:27] <stashbot>	 T412947: Reduce cache miss noise in memcached due to hcaptcha health checks - https://phabricator.wikimedia.org/T412947
[13:05:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki-root1002.eqiad.wmnet
[13:06:50] <wikibugs>	 (03CR) 10Fabfur: aptrepo,haproxy: add haproxy-awslc component/package (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[13:06:51] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[13:06:53] <James_F>	 I'm ready whenever
[13:07:12] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1264578|hCaptcha: Add APCu cache layer to health checker (T421204 T412947)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:07:27] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] wikikube: Switch to IPIP mode on workers [puppet] - 10https://gerrit.wikimedia.org/r/1260723 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm)
[13:08:19] <wikibugs>	 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11764400 (10ayounsi)
[13:09:44] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with sync
[13:10:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet
[13:12:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki-root1002.eqiad.wmnet
[13:12:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet
[13:14:38] <wikibugs>	 (03PS1) 10Kgraessle: Set live configuration for Extension:PersonalDashboard on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264631 (https://phabricator.wikimedia.org/T421415)
[13:15:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-etcd1003.eqiad.wmnet
[13:15:40] <kostajh>	 My change is syncing out now. James_F are you able to help with syncing MichaelG_WMF ’s patch?
[13:16:33] <James_F>	 Sure.
[13:16:52] <MichaelG_WMF>	 And ideally also eileen-m__ patches later (they're still struggling with NickServ, but I can test them as well)
[13:17:17] <MichaelG_WMF>	 but if those get pushed to next time than that is also not the end of the world
[13:17:17] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264578|hCaptcha: Add APCu cache layer to health checker (T421204 T412947)]] (duration: 11m 56s)
[13:17:23] <stashbot>	 T412947: Reduce cache miss noise in memcached due to hcaptcha health checks - https://phabricator.wikimedia.org/T412947
[13:17:34] <kostajh>	 Ok, over to you James_F
[13:17:40] <James_F>	 MichaelG_WMF: Mine are pretty chunky. :-( I'll do yours, then mine, then Eileen's.
[13:17:57] <MichaelG_WMF>	 James_F: understood, thank you! 🙏
[13:18:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet
[13:19:13] <eileen-m__93>	 Hi, I'm sorry I'm late.  I had to register my nick with nickserv.
[13:19:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-etcd1003.eqiad.wmnet
[13:19:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet
[13:19:47] <James_F>	 hashar: Can you please not mass-abandon code during a deploy? You've broken the reporting bot due to the backlog.
[13:20:06] <wikibugs>	 (03CR) 10Elukey: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski)
[13:26:26] <wikibugs>	 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11764615 (10ayounsi)
[13:26:38] <wikibugs>	 (03CR) 10Elukey: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski)
[13:28:43] <wikibugs>	 (03Abandoned) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez)
[13:28:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264590 (https://phabricator.wikimedia.org/T419181) (owner: 10Michael Große)
[13:28:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1261477 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester)
[13:29:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1262199 (https://phabricator.wikimedia.org/T421475) (owner: 10Jforrester)
[13:30:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-etcd1004.eqiad.wmnet
[13:30:50] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1264590|instrument(ReviseTone): record start of copyedit session (T419181)]], [[gerrit:1261477|Replace WANObjectCache with new MemcachedWrapper concept (T419666)]], [[gerrit:1262199|Fix match case for setting minute, week or month TTL on OrchestratorRequest (T421475)]]
[13:31:00] <stashbot>	 T419181: Update and Restart Revise Tone Experiment - https://phabricator.wikimedia.org/T419181
[13:31:00] <stashbot>	 T419666: WikiLambda: Replace direct usage of BagOStuff with WANObjectCache - https://phabricator.wikimedia.org/T419666
[13:31:00] <stashbot>	 T421475: OrchestratorRequest: fails setting ttl with UnhandledMatchError - https://phabricator.wikimedia.org/T421475
[13:31:16] <James_F>	 Finally.
[13:31:24] <moritzm>	 !log rebalance Ganeti cluster in ulsfo following the completion of the migration to routed Ganeti T421044
[13:31:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:29] <stashbot>	 T421044: ulsfo: balance VMs between all Ganeti nodes - https://phabricator.wikimedia.org/T421044
[13:32:33] <logmsgbot>	 !log jforrester@deploy1003 jforrester, migr: Backport for [[gerrit:1264590|instrument(ReviseTone): record start of copyedit session (T419181)]], [[gerrit:1261477|Replace WANObjectCache with new MemcachedWrapper concept (T419666)]], [[gerrit:1262199|Fix match case for setting minute, week or month TTL on OrchestratorRequest (T421475)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can
[13:32:33] <logmsgbot>	 now be verified there.
[13:33:28] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-jumbo1001.eqiad.wmnet with OS trixie
[13:33:33] <James_F>	 MichaelG_WMF: Can you check it's OK? Or is it not needed?
[13:34:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host ganeti-jumbo1001
[13:34:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-etcd1004.eqiad.wmnet
[13:34:15] <MichaelG_WMF>	 James_F: I can quickly check
[13:34:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[13:34:20] <MichaelG_WMF>	 which server should I use?
[13:34:59] <James_F>	 Any mwdebug
[13:35:09] <MichaelG_WMF>	 👍
[13:35:29] <James_F>	 I'm cross-checking mw-debug-eqiad and mw-debug-codfw at my end.
[13:35:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-etcd1005.eqiad.wmnet
[13:35:55] <MichaelG_WMF>	 James_F: looks all good on my side, no errors in the UI
[13:36:09] <logmsgbot>	 !log jforrester@deploy1003 jforrester, migr: Continuing with sync
[13:36:11] <MichaelG_WMF>	 tested with mw-debug-eqiad
[13:36:12] <James_F>	 Ack.
[13:36:52] <wikibugs>	 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11765005 (10ayounsi)
[13:37:19] <wikibugs>	 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11765038 (10ayounsi)
[13:37:23] <wikibugs>	 (03Merged) 10jenkins-bot: instrument(ReviseTone): record start of copyedit session [extensions/GrowthExperiments] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264590 (https://phabricator.wikimedia.org/T419181) (owner: 10Michael Große)
[13:37:27] <wikibugs>	 (03Merged) 10jenkins-bot: Replace WANObjectCache with new MemcachedWrapper concept [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1261477 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester)
[13:37:33] <wikibugs>	 (03Merged) 10jenkins-bot: Fix match case for setting minute, week or month TTL on OrchestratorRequest [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1262199 (https://phabricator.wikimedia.org/T421475) (owner: 10Jforrester)
[13:37:49] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Access to Data Engineering Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11765041 (10JArguello-WMF)
[13:38:15] <James_F>	 Why thank you wikibugs for telling us about patches that merged 8 minutes ago. :-(
[13:38:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262303 (https://phabricator.wikimedia.org/T420621) (owner: 10TChin)
[13:38:31] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[13:38:49] <wikibugs>	 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11765138 (10ayounsi)
[13:39:19] <wikibugs>	 06SRE, 10MediaModeration, 06Product Safety and Integrity, 07Essential-Work: MediaModeration: Increased thumbnail transform time since DC switchover - https://phabricator.wikimedia.org/T421688#11765148 (10Dreamy_Jazz) >>! In T421688#11764035, @Ladsgroup wrote: >>>! In T421688#11763978, @Dreamy_Jazz wrote: >...
[13:39:32] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[13:39:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-etcd1005.eqiad.wmnet
[13:40:03] <logmsgbot>	 bking@cumin2002 reimage (PID 3025222) is awaiting input
[13:40:09] <wikibugs>	 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11765184 (10ayounsi)
[13:40:24] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264590|instrument(ReviseTone): record start of copyedit session (T419181)]], [[gerrit:1261477|Replace WANObjectCache with new MemcachedWrapper concept (T419666)]], [[gerrit:1262199|Fix match case for setting minute, week or month TTL on OrchestratorRequest (T421475)]] (duration: 09m 33s)
[13:40:33] <stashbot>	 T419181: Update and Restart Revise Tone Experiment - https://phabricator.wikimedia.org/T419181
[13:40:33] <stashbot>	 T419666: WikiLambda: Replace direct usage of BagOStuff with WANObjectCache - https://phabricator.wikimedia.org/T419666
[13:40:33] <stashbot>	 T421475: OrchestratorRequest: fails setting ttl with UnhandledMatchError - https://phabricator.wikimedia.org/T421475
[13:41:44] <wikibugs>	 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11765237 (10ayounsi)
[13:41:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256432 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester)
[13:41:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast7002.wikimedia.org
[13:42:01] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1256432|Wikifunctions: Switch cache from mcrouter-wikifunctions to special access (T419666)]]
[13:42:11] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[13:42:31] <wikibugs>	 (03Merged) 10jenkins-bot: Wikifunctions: Switch cache from mcrouter-wikifunctions to special access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256432 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester)
[13:43:45] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1256432|Wikifunctions: Switch cache from mcrouter-wikifunctions to special access (T419666)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:44:17] <wikibugs>	 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11765301 (10ayounsi)
[13:44:29] <logmsgbot>	 !log jforrester@deploy1003 Sync cancelled.
[13:45:08] <logmsgbot>	 jclark@cumin1003 netbox (PID 3388826) is awaiting input
[13:45:09] <wikibugs>	 (03PS1) 10Jforrester: Revert "Wikifunctions: Switch cache from mcrouter-wikifunctions to special access" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264638 (https://phabricator.wikimedia.org/T411807)
[13:45:23] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Revert "Wikifunctions: Switch cache from mcrouter-wikifunctions to special access" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264638 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester)
[13:45:34] <Amir1>	 jouncebot: nowandnext
[13:45:34] <jouncebot>	 For the next 0 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260330T1300)
[13:45:34] <jouncebot>	 In 0 hour(s) and 44 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260330T1430)
[13:45:47] <James_F>	 OK, over to eileen-m__49's patches.
[13:45:51] <James_F>	 Amir1: Not now, please.
[13:46:02] <Amir1>	 noted. Mine is not urgent
[13:46:11] <James_F>	 After eileen-m__49 there's Raine's.
[13:46:24] <James_F>	 Ack
[13:46:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264605 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland)
[13:46:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [skins/MinervaNeue] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264625 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland)
[13:46:31] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Wikifunctions: Switch cache from mcrouter-wikifunctions to special access" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264638 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester)
[13:47:26] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.loadbalancer.check-ipip
[13:47:37] <eileen-m__49>	 Thank you!
[13:47:46] <James_F>	 Of course. Sorry it's taking so long.
[13:47:57] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1328.eqiad.wmnet with OS trixie
[13:48:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast7002.wikimedia.org
[13:48:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org
[13:48:25] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1328
[13:48:37] <logmsgbot>	 !log jayme@cumin1003 END (FAIL) - Cookbook sre.loadbalancer.check-ipip (exit_code=99)
[13:48:39] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[13:49:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ganeti-jumbo1001 - bking@cumin2002"
[13:49:06] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ganeti-jumbo1001 - bking@cumin2002"
[13:49:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:49:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache ganeti-jumbo1001.eqiad.wmnet 140.48.64.10.in-addr.arpa 0.4.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:49:11] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ganeti-jumbo1001.eqiad.wmnet 140.48.64.10.in-addr.arpa 0.4.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:49:12] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti-jumbo1001
[13:49:35] <wikibugs>	 (03PS1) 10CDanis: haproxy: CIDERGRINDER 🍎 globally 🚀🌍 [puppet] - 10https://gerrit.wikimedia.org/r/1264640
[13:51:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti-jumbo1001
[13:51:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ganeti-jumbo1001
[13:51:37] <wikibugs>	 (03Merged) 10jenkins-bot: Instrumentation: Track clicks for user account menu experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264605 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland)
[13:52:05] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11765346 (10Krd) I again cannot open https://lists.wikimedia.org/postorius/lists/daily-article-l.lists.wikimedia.org/...
[13:52:50] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.loadbalancer.check-ipip
[13:53:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11765358 (10Jgreen) a:05Jgreen→03VRiley-WMF @VRiley-WMF I'm not having much luck with this box. Running into two more issues:  - iDRAC not handling terminal correctly, arrow...
[13:54:01] <logmsgbot>	 !log jayme@cumin1003 END (FAIL) - Cookbook sre.loadbalancer.check-ipip (exit_code=99)
[13:54:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org
[13:54:15] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1264640 (owner: 10CDanis)
[13:54:28] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1328 - ayounsi@cumin1003"
[13:54:34] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1328 - ayounsi@cumin1003"
[13:54:34] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:54:34] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1328.eqiad.wmnet 129.32.64.10.in-addr.arpa 9.2.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:54:38] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1328.eqiad.wmnet 129.32.64.10.in-addr.arpa 9.2.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:54:39] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1328
[13:54:51] <wikibugs>	 (03CR) 10Ladsgroup: "👊 🇺🇸 🔥" [puppet] - 10https://gerrit.wikimedia.org/r/1264640 (owner: 10CDanis)
[13:55:00] <James_F>	 Maybe we should ban Minerva patches from backports unless they're the only ones. They're always so slow. :-(
[13:55:45] <eileen-m__49>	 TIL!  I didn't know Minerva patches would be slower to deploy.
[13:56:04] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1328
[13:56:04] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1328
[13:56:24] <James_F>	 Yeah, Minerva CI is massive because the Readers group (reasonably) are worried about lots of different things in the interface.
[13:56:35] <eileen-m__49>	 Got it.
[13:56:38] <James_F>	 So anything that touches that repo runs a huge number of tests.
[13:57:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:58:31] <James_F>	 Finally.
[13:58:44] <wikibugs>	 (03Merged) 10jenkins-bot: Display create account button in main menu when user is logged out. [skins/MinervaNeue] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1264625 (https://phabricator.wikimedia.org/T418053) (owner: 10EMcFarland)
[13:59:17] <jayme>	 !log enabling puppet on A:wikiube-worker-eqiad for T420436
[13:59:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:23] <stashbot>	 T420436: Migrate Wikikube k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420436
[13:59:45] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1264605|Instrumentation: Track clicks for user account menu experiment (T418053)]], [[gerrit:1264625|Display create account button in main menu when user is logged out. (T418053 T415647)]]
[13:59:52] <stashbot>	 T418053: Add user account button to mobile web header: Instrumentation and experiment setup for first iteration A/B Test - https://phabricator.wikimedia.org/T418053
[13:59:52] <stashbot>	 T415647: Add "Create account" menu item to mobile web hamburger menu - https://phabricator.wikimedia.org/T415647
[14:00:36] <Raine>	 will we have time for my change? or is the upcoming testkitchen window one we really shouldn't step on?
[14:00:59] <James_F>	 Raine: I don't know, sorry. But it's probably fine? Also Amir1 has something too.
[14:01:11] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Access to Data Engineering Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11765409 (10BTullis) I can pick this up and work with you on the details, to make sure tha...
[14:01:22] <James_F>	 eileen-m__49: For yours, can you check on mw-debug? It'll be there in a minute or so.
[14:01:32] <Raine>	 yeah... Amir1 is yours a config change? should we bundle them? I am only slightly nervous about mine :D 
[14:01:49] <logmsgbot>	 !log jforrester@deploy1003 emc-wmf, jforrester: Backport for [[gerrit:1264605|Instrumentation: Track clicks for user account menu experiment (T418053)]], [[gerrit:1264625|Display create account button in main menu when user is logged out. (T418053 T415647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:01:59] <eileen-m__49>	 James_F Sure, I have the extension activated and will check once I see the change.
[14:02:11] <James_F>	 Excellent. Should be there now.
[14:02:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-jumbo1001.eqiad.wmnet with reason: host reimage
[14:02:25] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "👍🍎" [puppet] - 10https://gerrit.wikimedia.org/r/1264640 (owner: 10CDanis)
[14:03:19] <Amir1>	 Raine: don't worry about mine. I can do it later
[14:03:22] <Amir1>	 many meetings
[14:03:30] <Raine>	 ok, thanks
[14:03:33] <James_F>	 Is now deploying from within a meeting.
[14:03:41] <James_F>	 Also, ^ /me etc.
[14:04:03] <eileen-m__49>	 James_F Is it fine if I use k8s-mwdebug for the server?
[14:04:09] <James_F>	 eileen-m__49: It's required.
[14:04:21] <James_F>	 Otherwise you will see the not-yet-deployed state.
[14:04:27] * Raine is happily not deploying from a bus 
[14:07:52] <James_F>	 eileen-m__49: Are we OK to deploy or should I roll back?
[14:08:01] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1328.eqiad.wmnet with reason: host reimage
[14:08:38] <eileen-m__49>	 @James_F everything looks as expected on the page and I am looking at Flamegraph now....
[14:08:41] <James_F>	 Ack.
[14:08:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-jumbo1001.eqiad.wmnet with reason: host reimage
[14:09:29] <wikibugs>	 (03CR) 10JavierMonton: [V:03+1] stream: mediawiki.page_edit_type_simple.dev1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261695 (https://phabricator.wikimedia.org/T421005) (owner: 10AKhatun)
[14:09:52] <eileen-m__49>	 James_F I don't have a lot of experience with Excimer UI.  Is there anything in particular that I should look at to detect red flags?  Nothing looks strange, but I could be missing something.
[14:10:07] <James_F>	 eileen-m__49: I don't use Excimer at all, sorry.
[14:10:08] <wikibugs>	 (03CR) 10JavierMonton: [V:03+1] stream: mediawiki.page_edit_type_simple.dev1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261706 (https://phabricator.wikimedia.org/T421005) (owner: 10AKhatun)
[14:11:40] <eileen-m__49>	 ack
[14:11:46] <James_F>	 Let's just procede?
[14:12:25] <eileen-m__49>	 yes
[14:12:27] <eileen-m__49>	 We can proceed
[14:12:29] <logmsgbot>	 !log jforrester@deploy1003 emc-wmf, jforrester: Continuing with sync
[14:12:29] <eileen-m__49>	 Thanks!
[14:12:33] <James_F>	 Of course.
[14:12:38] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1328.eqiad.wmnet with reason: host reimage
[14:13:57] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "LGTM. Would love to use this soon! :-)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259222 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus)
[14:16:42] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264605|Instrumentation: Track clicks for user account menu experiment (T418053)]], [[gerrit:1264625|Display create account button in main menu when user is logged out. (T418053 T415647)]] (duration: 16m 57s)
[14:16:49] <stashbot>	 T418053: Add user account button to mobile web header: Instrumentation and experiment setup for first iteration A/B Test - https://phabricator.wikimedia.org/T418053
[14:16:49] <stashbot>	 T415647: Add "Create account" menu item to mobile web hamburger menu - https://phabricator.wikimedia.org/T415647
[14:17:51] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section es7
[14:18:08] <James_F>	 OK, over to Raine.
[14:18:20] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section es7
[14:19:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:19:22] <Raine>	 wheee, thanks James_F 
[14:19:31] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 429561208 and 22 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:19:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kamila@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262091 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková)
[14:19:54] <wikibugs>	 (03CR) 10Jforrester: "Should I deploy this, or should I leave to one of you two? Other than confirming the service still runs and doesn't alert I'm not sure I'd" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus)
[14:20:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:20:31] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2779888 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:20:41] <wikibugs>	 (03Merged) 10jenkins-bot: Enable $wgTempCategoryCollations for s3 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262091 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková)
[14:20:56] <logmsgbot>	 !log kamila@deploy1003 Started scap sync-world: Backport for [[gerrit:1262091|Enable $wgTempCategoryCollations for s3 wikis. (T419274 T419049)]]
[14:21:06] <stashbot>	 T419274: ICU 72 upgrade: enable remote ICU collation writes - https://phabricator.wikimedia.org/T419274
[14:21:06] <stashbot>	 T419049: Upgrade the MediaWiki servers to ICU 72 ☂️ - https://phabricator.wikimedia.org/T419049
[14:22:41] <logmsgbot>	 !log kamila@deploy1003 kamila: Backport for [[gerrit:1262091|Enable $wgTempCategoryCollations for s3 wikis. (T419274 T419049)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:23:09] <cdanis>	 !log 💙cdanis@cumin1003.eqiad.wmnet ~ 🕥☕ sudo cumin 'A:cp' 'disable-puppet "cdanis CIDER 🍎"'
[14:23:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:59] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache:haproxy: suppress startup warn for haproxy 3.2 (lua scripts) [puppet] - 10https://gerrit.wikimedia.org/r/1261484 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur)
[14:24:07] <wikibugs>	 (03CR) 10CDanis: [C:03+2] haproxy: CIDERGRINDER 🍎 globally 🚀🌍 [puppet] - 10https://gerrit.wikimedia.org/r/1264640 (owner: 10CDanis)
[14:24:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:24:29] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-jumbo1001.eqiad.wmnet with OS trixie
[14:25:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:26:44] <logmsgbot>	 !log kamila@deploy1003 kamila: Continuing with sync
[14:27:28] <wikibugs>	 (03PS4) 10Jforrester: wikifunctions: Slim down staging resources, and fix main staging config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455
[14:27:28] <wikibugs>	 (03CR) 10Jforrester: wikifunctions: Slim down staging resources, and fix main staging config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 (owner: 10Jforrester)
[14:27:59] <wikibugs>	 (03CR) 10Jforrester: wikifunctions: Slim down staging resources, and fix main staging config (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 (owner: 10Jforrester)
[14:29:03] <wikibugs>	 10SRE-SLO: Sloth: productionize and onboard all SLOs - https://phabricator.wikimedia.org/T416262#11765564 (10herron)
[14:29:20] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11765567 (10BTullis)
[14:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260330T1430)
[14:30:42] <wikibugs>	 (03CR) 10Herron: [C:03+1] sre.o11y.thanos-compact-restart: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261375 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli)
[14:30:54] <wikibugs>	 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11765579 (10Alberto) Thank you for the heads-up regarding the version. I am aware that 1.39 is now EOL. My main priority right now is recovering the connection to Commons to stabilize the s...
[14:30:55] <logmsgbot>	 !log kamila@deploy1003 Finished scap sync-world: Backport for [[gerrit:1262091|Enable $wgTempCategoryCollations for s3 wikis. (T419274 T419049)]] (duration: 09m 59s)
[14:31:03] <stashbot>	 T419274: ICU 72 upgrade: enable remote ICU collation writes - https://phabricator.wikimedia.org/T419274
[14:31:03] <stashbot>	 T419049: Upgrade the MediaWiki servers to ICU 72 ☂️ - https://phabricator.wikimedia.org/T419049
[14:31:09] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: upgrade haproxy to version 3.2 on cp6001 and cp6009 [puppet] - 10https://gerrit.wikimedia.org/r/1261492 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur)
[14:31:10] <wikibugs>	 (03CR) 10Herron: [C:03+1] prometheus/pop: consolidate the firewall provider declaration at the role level. [puppet] - 10https://gerrit.wikimedia.org/r/1264339 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli)
[14:31:18] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1328.eqiad.wmnet with OS trixie
[14:31:56] <Raine>	 \o/ perfect timing :D
[14:32:05] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1329.eqiad.wmnet with OS trixie
[14:32:32] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1329
[14:32:40] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[14:32:52] <wikibugs>	 (03PS5) 10Jforrester: wikifunctions: Bump up orchestrator resources + 2->4/4->6 CPU for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261344 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey)
[14:32:57] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[14:33:21] <wikibugs>	 (03CR) 10Herron: "Yes this is what I7d69fde5d9f2055d42c7b404828eebcff521f025 is for and that approach also will need to be applied to the related dashboards" [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron)
[14:34:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:36:02] <wikibugs>	 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11765599 (10hnowlan)
[14:36:30] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1329 - ayounsi@cumin1003"
[14:36:36] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1329 - ayounsi@cumin1003"
[14:36:36] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:36:36] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1329.eqiad.wmnet 132.32.64.10.in-addr.arpa 2.3.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:36:40] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1329.eqiad.wmnet 132.32.64.10.in-addr.arpa 2.3.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:36:40] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1329
[14:37:00] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1329
[14:37:00] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1329
[14:37:26] <wikibugs>	 (03CR) 10Herron: "max() may alert slightly faster, since avg() would split differences between scrapes.  I don't have a strong preference, but it'll be good" [alerts] - 10https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron)
[14:39:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:39:24] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.loadbalancer.check-ipip
[14:40:01] <wikibugs>	 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11765643 (10BTullis)
[14:40:12] <logmsgbot>	 !log jayme@cumin1003 END (FAIL) - Cookbook sre.loadbalancer.check-ipip (exit_code=99)
[14:40:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:40:31] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 288163840 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:40:37] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261492 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur)
[14:42:10] <wikibugs>	 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11765660 (10Reedy) 1.42 is not supported either ;)  Note there's other instantcommons related changes that REL1_39 will be missing too.
[14:42:31] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 57376 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:44:30] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:45:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:46:09] <wikibugs>	 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11765700 (10Alberto) Understood, thank you for the correction! I see that moving to a current LTS like 1.44 is the way to go to ensure full compatibility with InstantCommons and security. I...
[14:49:08] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1329.eqiad.wmnet with reason: host reimage
[14:49:30] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:49:35] <wikibugs>	 (03PS1) 10Kamila Součková: Revert "Enable $wgTempCategoryCollations for s3 wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264651
[14:49:59] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.loadbalancer.check-ipip
[14:50:35] <cdanis>	 !log CIDERGRINDER 🍎 now deployed globally 🚀🌍
[14:50:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:52] <wikibugs>	 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11765744 (10LDlulisa-WMF)
[14:50:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kamila@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264651 (owner: 10Kamila Součková)
[14:51:09] <logmsgbot>	 !log jayme@cumin1003 END (FAIL) - Cookbook sre.loadbalancer.check-ipip (exit_code=99)
[14:51:59] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable $wgTempCategoryCollations for s3 wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264651 (owner: 10Kamila Součková)
[14:52:15] <logmsgbot>	 !log kamila@deploy1003 Started scap sync-world: Backport for [[gerrit:1264651|Revert "Enable $wgTempCategoryCollations for s3 wikis."]]
[14:53:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:54:00] <logmsgbot>	 !log kamila@deploy1003 kamila: Backport for [[gerrit:1264651|Revert "Enable $wgTempCategoryCollations for s3 wikis."]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:54:14] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1329.eqiad.wmnet with reason: host reimage
[14:54:44] <logmsgbot>	 !log kamila@deploy1003 kamila: Continuing with sync
[14:55:15] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11765780 (10Blake) Moving this to the backlog for now.
[14:56:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11765800 (10MoritzMuehlenhoff)
[14:57:49] <wikibugs>	 (03PS2) 10NMW03: Add delete-redirect to filemovers on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264652 (https://phabricator.wikimedia.org/T421373)
[14:58:02] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T421517#11765822 (10MoritzMuehlenhoff) p:05Triage→03Medium
[14:58:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:58:32] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 295726216 and 28 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:58:41] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264652 (https://phabricator.wikimedia.org/T421373) (owner: 10NMW03)
[14:58:58] <logmsgbot>	 !log kamila@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264651|Revert "Enable $wgTempCategoryCollations for s3 wikis."]] (duration: 06m 42s)
[14:59:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:59:32] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2888536 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:00:23] <wikibugs>	 (03PS1) 10Clare Ming: Add TestKitchenExposureResetEpoch config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264653 (https://phabricator.wikimedia.org/T414738)
[15:01:42] <fabfur>	 !log depooling cp6001 and cp6009 to upgrade haproxy to v 3.2 (T421402)
[15:01:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:47] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[15:02:48] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] eventstreams-internal - increase kafka max message size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254210 (https://phabricator.wikimedia.org/T420356) (owner: 10Ottomata)
[15:02:57] <logmsgbot>	 !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp6001.*
[15:03:05] <logmsgbot>	 !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp6009.*
[15:03:30] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:03:58] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to version 3.2 on cp6001 and cp6009 [puppet] - 10https://gerrit.wikimedia.org/r/1261492 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur)
[15:04:09] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+1] Add TestKitchenExposureResetEpoch config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264653 (https://phabricator.wikimedia.org/T414738) (owner: 10Clare Ming)
[15:04:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet
[15:05:16] <wikibugs>	 (03Merged) 10jenkins-bot: eventstreams-internal - increase kafka max message size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254210 (https://phabricator.wikimedia.org/T420356) (owner: 10Ottomata)
[15:05:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264653 (https://phabricator.wikimedia.org/T414738) (owner: 10Clare Ming)
[15:06:11] <logmsgbot>	 !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[15:06:32] <logmsgbot>	 !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[15:06:41] <logmsgbot>	 !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply
[15:07:54] <logmsgbot>	 !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply
[15:07:57] <wikibugs>	 (03PS2) 10Santiago Faci: Test Kitchen SLOs: Renaming slos because of the Test Kitchen renaming [puppet] - 10https://gerrit.wikimedia.org/r/1238312 (https://phabricator.wikimedia.org/T414381)
[15:08:03] <logmsgbot>	 !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[15:08:28] <wikibugs>	 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-admins for Jerrywang - https://phabricator.wikimedia.org/T419820#11765887 (10hnowlan) Thanks for handling this Ben! I'll remove the SRE tag to clear this from clinic duty for now, but please re-add it if you ne...
[15:08:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet
[15:09:31] <logmsgbot>	 !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[15:09:58] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1329.eqiad.wmnet with OS trixie
[15:10:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11765905 (10Jclark-ctr) ` jclark@backup1012:~$ sudo dmidecode -s chassis-serial-number C826SFM12A50003 jclark@backup1012:~$ sudo dmidecode -s baseboard-serial-...
[15:11:26] <logmsgbot>	 !log dancy@deploy1003 Started deploy [releng/jenkins-deploy@6f6a192] (releasing): Grant Overall/Administer to Arnaudb
[15:12:15] <logmsgbot>	 !log dancy@deploy1003 Finished deploy [releng/jenkins-deploy@6f6a192] (releasing): Grant Overall/Administer to Arnaudb (duration: 01m 01s)
[15:12:28] <wikibugs>	 (03CR) 10Santiago Faci: Test Kitchen SLOs: Renaming slos because of the Test Kitchen renaming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1238312 (https://phabricator.wikimedia.org/T414381) (owner: 10Santiago Faci)
[15:14:07] <wikibugs>	 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11765935 (10E.Enabulele)
[15:17:32] <wikibugs>	 (03PS1) 10Fabfur: haproxy: temporary removing haproxy3.2 specific conf [puppet] - 10https://gerrit.wikimedia.org/r/1264657 (https://phabricator.wikimedia.org/T421402)
[15:17:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06Traffic: Decommission codfw cp hosts cp2027-cp2040 - https://phabricator.wikimedia.org/T419753#11765947 (10Jhancock.wm) 05In progress→03Resolved a:03Jhancock.wm
[15:19:08] <wikibugs>	 (03Abandoned) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana)
[15:19:29] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra_dev: upgrade to Cassanra 4.1.11 [puppet] - 10https://gerrit.wikimedia.org/r/1262310 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans)
[15:20:08] <wikibugs>	 (03Abandoned) 10Mmartorana: config: Enable EmailConfirmationBanner on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261526 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana)
[15:22:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11765994 (10Jclark-ctr) @Papaul I’m stuck on these. I’m assuming Supermicro swapped the motherboard or chassis before shipping to eqiad and didn’t update the s...
[15:23:45] <wikibugs>	 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11766000 (10BTullis)
[15:24:36] <wikibugs>	 (03PS1) 10Elukey: CHANGELOG: add changelogs for release v12.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1264659
[15:24:36] <wikibugs>	 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11766006 (10Aklapper) > As a result, our server’s IP address appears to have been blocked or heavily throttled.  @Alberto: Hi, what makes you think so? Please provide exact error messages a...
[15:26:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org
[15:27:05] <wikibugs>	 (03CR) 10Elukey: "Hi Santiago! The SLO working group is going to announce later on that Pyrra is being replaced by Sloth, a tool completely integrated in Gr" [puppet] - 10https://gerrit.wikimedia.org/r/1238312 (https://phabricator.wikimedia.org/T414381) (owner: 10Santiago Faci)
[15:28:14] <wikibugs>	 (03PS1) 10Mmartorana: config: Enable EmailConfirmationBanner on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264662 (https://phabricator.wikimedia.org/T421366)
[15:29:03] <wikibugs>	 (03CR) 10Federico Ceratto: "Updated code and functional tests to use api_client." [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto)
[15:29:54] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264662 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana)
[15:31:20] <wikibugs>	 (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v12.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1264659 (owner: 10Elukey)
[15:31:47] <urandom>	 !log upgrade cassandra-dev2001 to Cassandra 4.1.11 — T418417
[15:31:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:52] <stashbot>	 T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417
[15:31:53] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxy: temporary removing haproxy3.2 specific conf [puppet] - 10https://gerrit.wikimedia.org/r/1264657 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur)
[15:32:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org
[15:32:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for atsuko [puppet] - 10https://gerrit.wikimedia.org/r/1264665
[15:32:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11766070 (10RobH) This host was purchased 2024-08-07, so it is still under warranty.  If Papaul doesn't know how to use the SUM (I've never used it) then the s...
[15:33:03] <wikibugs>	 (03PS3) 10D3r1ck01: Enable JWTs for OAuth1 consumers and OAuth2 owner-only consumers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260006 (https://phabricator.wikimedia.org/T417833)
[15:34:22] <wikibugs>	 (03PS1) 10Elukey: Upstream release v12.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1264666
[15:34:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for atsuko [puppet] - 10https://gerrit.wikimedia.org/r/1264665 (owner: 10Muehlenhoff)
[15:34:40] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v12.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1264666 (owner: 10Elukey)
[15:34:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11766099 (10Jclark-ctr) This has already been verified  already it is correct.  the serial number label on the outside of the host shows 'S480845X4915849'
[15:36:28] <jinxer-wm>	 FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[15:36:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org
[15:37:16] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp6001*} and A:cp - 3.2 test upgrade ()
[15:38:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for eenabulele [puppet] - 10https://gerrit.wikimedia.org/r/1264667
[15:38:49] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1330.eqiad.wmnet with OS trixie
[15:39:17] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1330
[15:40:18] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[15:42:10] <elukey>	 !log uploaded spicerack_12.3.0 to apt.wikimedia.org bookworm-wikimedia
[15:42:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:24] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp6001*} and A:cp - 3.2 test upgrade ()
[15:42:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org
[15:42:37] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp6009*} and A:cp - 3.2 test upgrade ()
[15:43:44] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance
[15:43:53] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2151 (T419635)', diff saved to https://phabricator.wikimedia.org/P89969 and previous config saved to /var/cache/conftool/dbconfig/20260330-154352-fceratto.json
[15:43:58] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[15:44:06] <elukey>	 !log upgrade spicerack on cumin2002
[15:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:45:56] <logmsgbot>	 ayounsi@cumin1003 reimage (PID 3589787) is awaiting input
[15:46:28] <jinxer-wm>	 FIRING: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[15:47:59] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp6009*} and A:cp - 3.2 test upgrade ()
[15:48:18] <wikibugs>	 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11766183 (10hnowlan) Hi Alberto, thanks for getting in touch about this. At present we have no blocks specific to Urbipedia or your specific IP address. However, it appears that you might b...
[15:49:13] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:49:14] <wikibugs>	 (03PS1) 10Elukey: profile::pki::intermediates: refresh discovery's public key [puppet] - 10https://gerrit.wikimedia.org/r/1264669 (https://phabricator.wikimedia.org/T420993)
[15:51:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for eenabulele [puppet] - 10https://gerrit.wikimedia.org/r/1264667 (owner: 10Muehlenhoff)
[15:51:27] <fabfur>	 !log repooling cp6001 and cp6009 (T421402)
[15:51:34] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1330 - ayounsi@cumin1003"
[15:51:40] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1330 - ayounsi@cumin1003"
[15:51:40] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:51:40] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1330.eqiad.wmnet 163.48.64.10.in-addr.arpa 3.6.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:51:44] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1330.eqiad.wmnet 163.48.64.10.in-addr.arpa 3.6.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:51:44] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1330
[15:51:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:56] <logmsgbot>	 !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp6009.*
[15:51:56] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[15:52:00] <logmsgbot>	 !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp6001.*
[15:52:08] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1330
[15:52:08] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1330
[15:52:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T419635)', diff saved to https://phabricator.wikimedia.org/P89970 and previous config saved to /var/cache/conftool/dbconfig/20260330-155242-fceratto.json
[15:52:49] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[15:54:38] <wikibugs>	 (03PS1) 10Kamila Součková: Enable $wgTempCategoryCollations for s3 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264670 (https://phabricator.wikimedia.org/T419274)
[15:55:42] <wikibugs>	 (03CR) 10Kamila Součková: "Attempt #2 after revert due to T421732, hopefully this time with the correct number of `[]`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264670 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková)
[15:59:06] <moritzm>	 !log rearmed keyholder on netmon* hosts following reboots
[15:59:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:25] <wikibugs>	 (03PS1) 10JMeybohm: machinetranslation: Remove networkpolicies for people* [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264671 (https://phabricator.wikimedia.org/T335491)
[16:01:28] <jinxer-wm>	 RESOLVED: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[16:01:38] <wikibugs>	 (03CR) 10Santiago Faci: "Cool! Thanks for letting us know!" [puppet] - 10https://gerrit.wikimedia.org/r/1238312 (https://phabricator.wikimedia.org/T414381) (owner: 10Santiago Faci)
[16:01:46] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Raine - This looks good. As an additional check, maybe it makes sense to `mw-debug-repl` into one of the in-scope wikis while in t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264670 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková)
[16:02:47] <wikibugs>	 (03CR) 10Kamila Součková: "That seems like an excellent idea :D Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264670 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková)
[16:02:52] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P89971 and previous config saved to /var/cache/conftool/dbconfig/20260330-160251-fceratto.json
[16:03:59] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1330.eqiad.wmnet with reason: host reimage
[16:09:01] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1330.eqiad.wmnet with reason: host reimage
[16:09:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:10:55] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11766321 (10Ladsgroup) Page previews is still requesting non-standard sizes still. For example, go to https://en.wikipedia.org/wiki/M...
[16:11:34] <wikibugs>	 (03CR) 10Elukey: "This needs to be coupled with ./modules/secret/secrets/pki/intermediates/discovery-key.pem in puppet private, I am now trying to figure ou" [puppet] - 10https://gerrit.wikimedia.org/r/1264669 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[16:11:45] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS trixie
[16:13:00] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P89972 and previous config saved to /var/cache/conftool/dbconfig/20260330-161259-fceratto.json
[16:22:06] <wikibugs>	 (03CR) 10Elukey: "Ahh wait ok:" [puppet] - 10https://gerrit.wikimedia.org/r/1264669 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[16:23:07] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T419635)', diff saved to https://phabricator.wikimedia.org/P89973 and previous config saved to /var/cache/conftool/dbconfig/20260330-162307-fceratto.json
[16:23:12] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[16:23:24] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance
[16:23:32] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2158 (T419635)', diff saved to https://phabricator.wikimedia.org/P89974 and previous config saved to /var/cache/conftool/dbconfig/20260330-162331-fceratto.json
[16:24:58] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1330.eqiad.wmnet with OS trixie
[16:32:40] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T419635)', diff saved to https://phabricator.wikimedia.org/P89975 and previous config saved to /var/cache/conftool/dbconfig/20260330-163239-fceratto.json
[16:32:45] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[16:34:13] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:37:43] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1111.eqiad.wmnet with OS trixie
[16:38:03] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS trixie
[16:39:13] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:42:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11766443 (10Jgreen) All four are switched to UEFI and built.
[16:42:49] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P89976 and previous config saved to /var/cache/conftool/dbconfig/20260330-164248-fceratto.json
[16:44:13] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:45:09] <wikibugs>	 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11766449 (10BTullis) @LDlulisa-WMF , @RThomas-WMF , @E.Enabulele - I think that the nex...
[16:46:55] <wikibugs>	 (03PS6) 10Jasmine: service::catalog: add sophroid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748)