[00:01:07] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[00:02:45] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp4042 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[00:02:45] <icinga-wm>	 PROBLEM - statsv Varnishkafka log producer on cp4038 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[00:02:47] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp4043 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[00:02:47] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp4037 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[00:02:48] <icinga-wm>	 PROBLEM - statsv Varnishkafka log producer on cp4039 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[00:09:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: man-db.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:24:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: man-db.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:25:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10655153 (10phaultfinder)
[00:33:32] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10655170 (10Papaul)
[00:35:23] <wikibugs>	 (03PS1) 10Gergő Tisza: Enable SUL3 logins for 1% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153)
[00:37:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza)
[00:38:53] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1129551
[00:38:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1129551 (owner: 10TrainBranchBot)
[00:39:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable SUL3 logins for 1% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza)
[00:49:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: man-db.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:53:44] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[00:54:36] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1129551 (owner: 10TrainBranchBot)
[01:08:43] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1129583
[01:08:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1129583 (owner: 10TrainBranchBot)
[01:10:09] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "Yea, logically it makes sense to me to create the user in mediawiki::system_users, the compiler output looks good and the number of affect" [puppet] - 10https://gerrit.wikimedia.org/r/1129389 (owner: 10Ahmon Dancy)
[01:11:39] <wikibugs>	 (03CR) 10Dzahn: "@muehlenhoff this would be after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129389/3" [puppet] - 10https://gerrit.wikimedia.org/r/1129408 (owner: 10Ahmon Dancy)
[01:37:41] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1129583 (owner: 10TrainBranchBot)
[02:54:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:09:18] <wikibugs>	 (03PS1) 10Chuckonwumelu: Add new profile [labs/private] - 10https://gerrit.wikimedia.org/r/1129595
[03:32:21] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[04:01:07] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[04:33:30] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[04:48:30] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[04:49:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: man-db.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:53:30] <jinxer-wm>	 FIRING: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[04:53:44] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:38:24] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] vrts: add parameters for exim_deny_senders from private repo [puppet] - 10https://gerrit.wikimedia.org/r/1129369 (https://phabricator.wikimedia.org/T389356) (owner: 10Dzahn)
[05:41:40] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] vrts: add profile::vrts::exim_deny_senders with fake value [labs/private] - 10https://gerrit.wikimedia.org/r/1129374 (https://phabricator.wikimedia.org/T389079) (owner: 10Dzahn)
[05:56:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T0600)
[06:05:53] <marostegui>	 jouncebot: next
[06:05:53] <jouncebot>	 In 1 hour(s) and 54 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T0800)
[06:06:58] <wikibugs>	 (03PS1) 10Marostegui: db-production.php: Disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129609 (https://phabricator.wikimedia.org/T388627)
[06:08:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by marostegui@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129609 (https://phabricator.wikimedia.org/T388627) (owner: 10Marostegui)
[06:08:30] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[06:08:46] <wikibugs>	 (03Merged) 10jenkins-bot: db-production.php: Disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129609 (https://phabricator.wikimedia.org/T388627) (owner: 10Marostegui)
[06:08:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1248 crash - https://phabricator.wikimedia.org/T388837#10655783 (10Marostegui) Thank you!
[06:09:46] <logmsgbot>	 !log marostegui@deploy2002 Started scap sync-world: Backport for [[gerrit:1129609|db-production.php: Disable writes on es7 (T388627)]]
[06:09:50] <stashbot>	 T388627: Disable circular replication after DC switchover - https://phabricator.wikimedia.org/T388627
[06:13:14] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1129609|db-production.php: Disable writes on es7 (T388627)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[06:13:17] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[06:13:20] <wikibugs>	 (03PS1) 10Marostegui: db1246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1129610 (https://phabricator.wikimedia.org/T387673)
[06:13:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1129610 (https://phabricator.wikimedia.org/T387673) (owner: 10Marostegui)
[06:14:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10655795 (10Marostegui) I am automatically slowly pooling this host back
[06:14:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P74266 and previous config saved to /var/cache/conftool/dbconfig/20250320-061426-root.json
[06:17:04] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129611
[06:17:10] <wikibugs>	 (03CR) 10Marostegui: [C:04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129611 (owner: 10Marostegui)
[06:19:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10655807 (10phaultfinder)
[06:20:53] <logmsgbot>	 !log marostegui@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129609|db-production.php: Disable writes on es7 (T388627)]] (duration: 11m 07s)
[06:20:57] <stashbot>	 T388627: Disable circular replication after DC switchover - https://phabricator.wikimedia.org/T388627
[06:21:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section es7
[06:21:52] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section es7
[06:22:45] <wikibugs>	 (03CR) 10Marostegui: Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129611 (owner: 10Marostegui)
[06:22:47] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129611 (owner: 10Marostegui)
[06:23:00] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section es6
[06:23:15] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section es6
[06:23:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129611 (owner: 10Marostegui)
[06:23:38] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section x1
[06:23:51] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section x1
[06:24:27] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s8
[06:24:40] <logmsgbot>	 !log marostegui@deploy2002 Started scap sync-world: Backport for [[gerrit:1129611|Revert "db-production.php: Disable writes on es7"]]
[06:24:43] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s8
[06:25:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s7
[06:25:59] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s7
[06:26:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s6
[06:26:30] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s6
[06:26:53] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s5
[06:28:17] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s5
[06:28:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s4
[06:29:13] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s4
[06:29:30] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1129611|Revert "db-production.php: Disable writes on es7"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[06:29:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P74267 and previous config saved to /var/cache/conftool/dbconfig/20250320-062931-root.json
[06:29:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s3
[06:30:07] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[06:30:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s3
[06:30:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s2
[06:30:43] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s2
[06:31:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s1
[06:31:37] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s1
[06:34:56] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s8 T389367
[06:35:00] <stashbot>	 T389367: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T389367
[06:35:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2161 with weight 0 T389367', diff saved to https://phabricator.wikimedia.org/P74268 and previous config saved to /var/cache/conftool/dbconfig/20250320-063509-marostegui.json
[06:36:39] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1129288 (https://phabricator.wikimedia.org/T389367) (owner: 10Gerrit maintenance bot)
[06:37:29] <logmsgbot>	 !log marostegui@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129611|Revert "db-production.php: Disable writes on es7"]] (duration: 12m 48s)
[06:39:44] <marostegui>	 !log Starting s8 codfw failover from db2165 to db2161 - T389367
[06:39:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2161 to s8 primary T389367', diff saved to https://phabricator.wikimedia.org/P74269 and previous config saved to /var/cache/conftool/dbconfig/20250320-064012-marostegui.json
[06:40:16] <stashbot>	 T389367: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T389367
[06:41:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2165 T389367', diff saved to https://phabricator.wikimedia.org/P74270 and previous config saved to /var/cache/conftool/dbconfig/20250320-064131-marostegui.json
[06:43:12] <wikibugs>	 (03PS1) 10Marostegui: db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1129725 (https://phabricator.wikimedia.org/T387441)
[06:43:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2165.codfw.wmnet
[06:44:07] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1129725 (https://phabricator.wikimedia.org/T387441) (owner: 10Marostegui)
[06:44:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74271 and previous config saved to /var/cache/conftool/dbconfig/20250320-064437-root.json
[06:50:52] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2165.codfw.wmnet
[06:51:52] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15:00:00 on db2165.codfw.wmnet with reason: Maintenance
[06:54:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:58:27] <wikibugs>	 (03Abandoned) 10Ayounsi: Remove v6 include for e8/f8 uplinks [dns] - 10https://gerrit.wikimedia.org/r/1091711 (https://phabricator.wikimedia.org/T380050) (owner: 10Ayounsi)
[06:59:25] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:59:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74272 and previous config saved to /var/cache/conftool/dbconfig/20250320-065942-root.json
[07:02:24] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Upgrade snapshot hosts to PHP version 8.1 - except snapshot1016 [puppet] - 10https://gerrit.wikimedia.org/r/1129343 (https://phabricator.wikimedia.org/T382484) (owner: 10Btullis)
[07:04:25] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:13:32] <wikibugs>	 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10655910 (10ayounsi) @RobH make sure to link the inbound shipment to the existing ticket, so remote hands can set it up directly.  Let's also use the initial positions : port...
[07:14:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10655911 (10phaultfinder)
[07:14:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74273 and previous config saved to /var/cache/conftool/dbconfig/20250320-071448-root.json
[07:24:37] <moritzm>	 !log rebalance ganeti eqiad/C following reimages T382507
[07:24:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:41] <stashbot>	 T382507: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507
[07:25:29] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: add function to replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128779 (https://phabricator.wikimedia.org/T389170)
[07:26:20] <wikibugs>	 (03PS2) 10Filippo Giunchedi: alertmanager: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128780 (https://phabricator.wikimedia.org/T389170)
[07:28:47] <wikibugs>	 (03PS1) 10Filippo Giunchedi: kubernetes: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129178 (https://phabricator.wikimedia.org/T389170)
[07:28:48] <wikibugs>	 (03PS1) 10Filippo Giunchedi: snmp_exporter: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129179 (https://phabricator.wikimedia.org/T389170)
[07:29:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:29:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74274 and previous config saved to /var/cache/conftool/dbconfig/20250320-072953-root.json
[07:31:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet
[07:32:21] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[07:32:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet
[07:32:37] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: refactor nova.py with cache [puppet] - 10https://gerrit.wikimedia.org/r/1129370
[07:32:50] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: refactor Filter to work with CloudHost [puppet] - 10https://gerrit.wikimedia.org/r/1129371
[07:33:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti5005 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1129262 (owner: 10Muehlenhoff)
[07:34:25] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:35:31] <logmsgbot>	 !log btullis@deploy2002 Started deploy [dumps/dumps@2fe1059]: Fixing index out of range error
[07:35:31] <wikibugs>	 (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza)
[07:35:32] <logmsgbot>	 !log btullis@deploy2002 Finished deploy [dumps/dumps@2fe1059]: Fixing index out of range error (duration: 00m 09s)
[07:35:44] <logmsgbot>	 !log btullis@deploy2002 Started deploy [dumps/dumps@2fe1059]: Fixing index out of range error
[07:35:50] <logmsgbot>	 !log btullis@deploy2002 Finished deploy [dumps/dumps@2fe1059]: Fixing index out of range error (duration: 00m 09s)
[07:35:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: refactor nova.py with cache [puppet] - 10https://gerrit.wikimedia.org/r/1129370 (owner: 10Filippo Giunchedi)
[07:35:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: refactor Filter to work with CloudHost [puppet] - 10https://gerrit.wikimedia.org/r/1129371 (owner: 10Filippo Giunchedi)
[07:41:15] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s::worker: move ml-serve2001 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1129762 (https://phabricator.wikimedia.org/T387854)
[07:41:21] <logmsgbot>	 !log btullis@deploy2002 Started deploy [dumps/dumps@2fe1059]: Fixing index out of range error
[07:41:28] <logmsgbot>	 !log btullis@deploy2002 Finished deploy [dumps/dumps@2fe1059]: Fixing index out of range error (duration: 00m 26s)
[07:41:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] role::ml_k8s::worker: move ml-serve2001 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1129762 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[07:42:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Echo] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129336 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński)
[07:42:26] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Echo] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129362 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński)
[07:43:05] <wikibugs>	 (03PS2) 10Elukey: role::ml_k8s::worker: move ml-serve2001 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1129762 (https://phabricator.wikimedia.org/T387854)
[07:44:05] <wikibugs>	 (03CR) 10Elukey: [C:03+2] maps: remove Kartotherian from bare metal nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128348 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey)
[07:45:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74275 and previous config saved to /var/cache/conftool/dbconfig/20250320-074459-root.json
[07:45:02] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5117/" [puppet] - 10https://gerrit.wikimedia.org/r/1129762 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[07:45:34] <wikibugs>	 (03PS3) 10Elukey: role::ml_k8s::worker: move ml-serve2002 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1129762 (https://phabricator.wikimedia.org/T387854)
[07:46:59] <elukey>	 !log remove kartotherian from maps* bare metal nodes
[07:47:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:29] <wikibugs>	 (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza)
[07:54:25] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:57:01] <wikibugs>	 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10655946 (10fgiunchedi) >>! In T374711#10652054, @jhathaway wrote: >>>! In T374711#10650455, @fgiunchedi wrote: >> There's two parts to keyholder, `-proxy...
[07:59:25] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:59:26] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti5005.eqsin.wmnet
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T0800).
[08:00:05] <jouncebot>	 tgr and MichaelG_WMF: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:13] <MichaelG_WMF>	 o/
[08:00:56] <tgr_>	 my config change is failing with "Script git diff stash@{0} stash@{1} --minimal --color --exit-code handling the diffConfig event returned with error code 1"
[08:01:05] <tgr_>	 I can't make sense of that error
[08:01:07] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[08:02:12] <tgr_>	 the 7.4 diffConfig produces the exact same message but passes
[08:02:13] <MichaelG_WMF>	 since when do we have 8.1 jobs in config?
[08:02:27] <MichaelG_WMF>	 that seems like a CI-error
[08:02:47] <tgr_>	 we are mostly on 8.1 now so it would make sense
[08:03:00] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.wikireplicas.update-views
[08:03:09] <MichaelG_WMF>	 AIUI the job in itself has to "fail" (in the logs) to show the diff, but somehow that is overwritten in the final consideration
[08:03:20] <MichaelG_WMF>	 makes sense yes, but since when is it actually live?
[08:03:27] * MichaelG_WMF looks at previous changes
[08:04:00] <MichaelG_WMF>	 mine do not have them: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/1129336
[08:04:25] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:04:38] <wikibugs>	 (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129229 (https://phabricator.wikimedia.org/T389348) (owner: 10Reedy)
[08:04:39] <MichaelG_WMF>	 so, probably they were added last night, but a mistake was made and not discovered until now?
[08:04:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10655952 (10phaultfinder)
[08:05:01] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] role::ml_k8s::worker: move ml-serve2002 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1129762 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[08:05:46] <wikibugs>	 (03CR) 10Michael Große: "recheck" [extensions/Echo] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129336 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński)
[08:05:59] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5005.eqsin.wmnet
[08:06:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Initial stub role for mariadb::research [puppet] - 10https://gerrit.wikimedia.org/r/1129764 (https://phabricator.wikimedia.org/T389089)
[08:06:39] <tgr_>	 I guess that would have been https://gerrit.wikimedia.org/r/c/integration/config/+/1129364 ?
[08:07:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet
[08:07:36] <MichaelG_WMF>	 sounds plausible
[08:08:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Initial stub role for mariadb::research [puppet] - 10https://gerrit.wikimedia.org/r/1129764 (https://phabricator.wikimedia.org/T389089) (owner: 10Muehlenhoff)
[08:08:53] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0)
[08:09:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet
[08:09:01] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Release version 0.1.1 [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1115334 (owner: 10Slyngshede)
[08:09:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti5006 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1129263 (owner: 10Muehlenhoff)
[08:09:25] <jinxer-wm>	 RESOLVED: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:10:18] <tgr_>	 08:46:18 Script git diff stash@{0} stash@{1} --minimal --color --exit-code handling the diffConfig event returned with error code 1
[08:10:21] <tgr_>	 08:46:18 Build step 'Execute shell' marked build as failure
[08:10:23] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] varnish: X-Requestctl is now being handled by HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1125986 (owner: 10Vgutierrez)
[08:10:29] <tgr_>	 (sorry wrong buffer)
[08:10:33] <tgr_>	 https://gerrit.wikimedia.org/r/c/integration/config/+/1129765
[08:10:36] <tgr_>	 I think
[08:10:48] <wikibugs>	 (03Merged) 10jenkins-bot: Release version 0.1.1 [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1115334 (owner: 10Slyngshede)
[08:12:17] <tgr_>	 other config patches seem to be passing, though?
[08:12:32] <tgr_>	 anyway I can deploy the backport in the meantime
[08:13:35] <MichaelG_WMF>	 your change looks good, not sure why CI is failing for it
[08:13:48] <MichaelG_WMF>	 @tgr_ yes, deploying my backports would be great :)
[08:14:06] <wikibugs>	 (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128913 (owner: 10Giuseppe Lavagetto)
[08:15:20] <MichaelG_WMF>	 my backports are not properly testable. We are adding logging so we can figure out in what circumstances a warning occurs, which means we cannot actively trigger it to test it. The warning may or may not show up in the Echo channel
[08:15:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CA: add timeout for token validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128913 (owner: 10Giuseppe Lavagetto)
[08:16:16] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: switchover to gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1128814 (https://phabricator.wikimedia.org/T387833)
[08:16:19] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: switchover to gerrit2002 [dns] - 10https://gerrit.wikimedia.org/r/1128818 (https://phabricator.wikimedia.org/T387833)
[08:16:36] <wikibugs>	 (03PS4) 10Arnaudb: gerrit: switchover to gerrit2002 [dns] - 10https://gerrit.wikimedia.org/r/1128818 (https://phabricator.wikimedia.org/T387833)
[08:16:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti5007 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1129766
[08:16:48] <MichaelG_WMF>	 re other config changes: I think CI on [CommonSettings: Migrate CentralNotice to Virtual Domains](https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1129229) because it does not actually change anything?
[08:17:11] <MichaelG_WMF>	 and for that job, success and failure are swapped. 
[08:18:36] <tgr_>	 we are still using codfw for deploying, right?
[08:18:53] <MichaelG_WMF>	 no idea, I'm sorry
[08:19:22] <tgr_>	 I imagine there would be a motd saying so if we didn't
[08:19:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/Echo] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129336 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński)
[08:19:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/Echo] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129362 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński)
[08:20:11] <moritzm>	 !log installing python-cryptography security updates
[08:20:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:22] <wikibugs>	 (03CR) 10Gergő Tisza: "(sorry just testing T389460)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128913 (owner: 10Giuseppe Lavagetto)
[08:23:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add logging to help figure unserialization issues [extensions/Echo] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129336 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński)
[08:23:14] <MichaelG_WMF>	 tgr_: I think the phan job on the -wmf.20 backport died. Not sure why
[08:24:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656027 (10phaultfinder)
[08:26:15] <MichaelG_WMF>	 looks like some issue setting up the workspace? pretty sure that is unrelated to the actual change
[08:26:58] <wikibugs>	 (03PS2) 10Muehlenhoff: Initial stub role for mariadb::research [puppet] - 10https://gerrit.wikimedia.org/r/1129764 (https://phabricator.wikimedia.org/T389089)
[08:29:17] <tgr_>	 rsync: [sender] change_dir "/castor-mw-ext-and-skins/wmf-1.44.0-wmf.20/mwext-php74-phan" (in caches) failed: No such file or directory (2)
[08:29:44] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] Add logging to help figure unserialization issues [extensions/Echo] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129336 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński)
[08:31:02] <wikibugs>	 (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza)
[08:32:11] <Emperor>	 !log restart swift-proxy on ms-fe2010 T360913
[08:32:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:15] <stashbot>	 T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913
[08:33:09] <wikibugs>	 (03Merged) 10jenkins-bot: Add logging to help figure unserialization issues [extensions/Echo] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129362 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński)
[08:33:11] <wikibugs>	 (03Merged) 10jenkins-bot: Add logging to help figure unserialization issues [extensions/Echo] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129336 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński)
[08:34:41] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1129336|Add logging to help figure unserialization issues (T388725)]], [[gerrit:1129362|Add logging to help figure unserialization issues (T388725)]]
[08:34:45] <stashbot>	 T388725: PHP Warning: Class __PHP_Incomplete_Class has no unserializer - https://phabricator.wikimedia.org/T388725
[08:36:34] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet
[08:37:28] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] network/data.yaml: add sandbox1-b3-magru [puppet] - 10https://gerrit.wikimedia.org/r/1129219 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi)
[08:40:21] <wikibugs>	 (03PS1) 10Majavah: Revert "cloudgw/icmp check/ip6: disabling" [puppet] - 10https://gerrit.wikimedia.org/r/1129770 (https://phabricator.wikimedia.org/T388379)
[08:40:37] <XioNoX>	 !log merge/deploy network/data.yaml: add sandbox1-b3-magru
[08:40:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:14] <logmsgbot>	 !log tgr@deploy2002 matmarex, tgr: Backport for [[gerrit:1129336|Add logging to help figure unserialization issues (T388725)]], [[gerrit:1129362|Add logging to help figure unserialization issues (T388725)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:41:17] <stashbot>	 T388725: PHP Warning: Class __PHP_Incomplete_Class has no unserializer - https://phabricator.wikimedia.org/T388725
[08:43:08] <XioNoX>	 !log deploy pfw policy - T389456
[08:43:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:11] <logmsgbot>	 !log tgr@deploy2002 matmarex, tgr: Continuing with sync
[08:43:13] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet
[08:44:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656067 (10phaultfinder)
[08:45:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet
[08:46:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Revert "cloudgw/icmp check/ip6: disabling" [puppet] - 10https://gerrit.wikimedia.org/r/1129770 (https://phabricator.wikimedia.org/T388379) (owner: 10Majavah)
[08:47:42] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::ml_k8s::worker: move ml-serve2002 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1129762 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[08:48:00] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Revert "cloudgw/icmp check/ip6: disabling" [puppet] - 10https://gerrit.wikimedia.org/r/1129770 (https://phabricator.wikimedia.org/T388379) (owner: 10Majavah)
[08:48:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5007.eqsin.wmnet
[08:48:56] <wikibugs>	 (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128913 (owner: 10Giuseppe Lavagetto)
[08:49:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656099 (10phaultfinder)
[08:50:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1128225 (owner: 10Filippo Giunchedi)
[08:50:46] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129336|Add logging to help figure unserialization issues (T388725)]], [[gerrit:1129362|Add logging to help figure unserialization issues (T388725)]] (duration: 16m 05s)
[08:50:51] <stashbot>	 T388725: PHP Warning: Class __PHP_Incomplete_Class has no unserializer - https://phabricator.wikimedia.org/T388725
[08:51:02] <wikibugs>	 (03PS2) 10Filippo Giunchedi: base: don't show diff for phaste config [puppet] - 10https://gerrit.wikimedia.org/r/1128225
[08:51:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti5007 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1129766 (owner: 10Muehlenhoff)
[08:51:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] base: don't show diff for phaste config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1128225 (owner: 10Filippo Giunchedi)
[08:51:47] <MichaelG_WMF>	 and I'm already seeing the new warnings rolling in. Thank you!
[08:52:04] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Remove HE through SG.IX [homer/public] - 10https://gerrit.wikimedia.org/r/1129320 (https://phabricator.wikimedia.org/T386987) (owner: 10Ayounsi)
[08:52:38] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve2002.codfw.wmnet with OS bookworm
[08:53:03] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host ml-serve2002
[08:53:12] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.dns.netbox
[08:53:44] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[08:54:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:55:05] <wikibugs>	 (03CR) 10Ayounsi: "Awesome, thanks!" [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi)
[08:55:11] <Emperor>	 !log restart swift-proxy on ms-fe1010 T360913
[08:55:11] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] nginx: Remove prometheus.lua [puppet] - 10https://gerrit.wikimedia.org/r/1036672 (owner: 10Muehlenhoff)
[08:55:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:15] <stashbot>	 T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913
[08:57:54] <tgr_>	 jnuche: is it OK if I run over the window by ~20 min?
[08:58:10] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2002 - elukey@cumin1002"
[08:58:16] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2002 - elukey@cumin1002"
[08:58:16] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:58:16] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache ml-serve2002.codfw.wmnet 43.16.192.10.in-addr.arpa 3.4.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[08:58:19] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-serve2002.codfw.wmnet 43.16.192.10.in-addr.arpa 3.4.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[08:58:20] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve2002
[08:58:20] <tgr_>	 sorry, trying to do too many things at once
[08:58:31] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve2002
[08:58:31] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve2002
[08:58:39] <tgr_>	 (almost done setting up a test environment for ProofreadPage though)
[08:58:40] <jnuche>	 trg_: yeah, it's no problem, as you know the train is blocked atm anyway
[08:58:58] <jnuche>	 thanks for working on that btw :)
[08:59:19] <tgr_>	 the fix is easy. Getting to the point where I can test it apparently isn't
[09:00:04] <jouncebot>	 jnuche and jeena: MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T0900). Please do the needful.
[09:00:32] <jnuche>	 morning, as just mentioned ^, train blocked on T389430
[09:00:32] <stashbot>	 T389430: Page and Index namespaces from ProofreadPage extension no longer considered content namespaces since deploy of 1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389430
[09:00:47] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet
[09:01:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza)
[09:01:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:01:45] <wikibugs>	 (03CR) 10Elukey: [C:03+1] spicerack: convert some @property into methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/1129375 (https://phabricator.wikimedia.org/T389329) (owner: 10Volans)
[09:02:31] <wikibugs>	 (03Merged) 10jenkins-bot: Enable SUL3 logins for 1% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza)
[09:02:41] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm
[09:03:00] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1129546|Enable SUL3 logins for 1% of group 1 users (T384153)]]
[09:03:03] <stashbot>	 T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153
[09:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:07:37] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet
[09:08:39] <tgr_>	 jnuche: so, I don't know anything about ProofreadPage, but locally the siteinfo API tells me Page and Index are content namespaces with the patch, and editing those namespaces works
[09:09:02] <tgr_>	 (I didn't try to recreate the cross-extension conflict locally, but pretty sure this is the right way to fix it)
[09:09:11] <tgr_>	 do we need to find a reviewer for that patch?
[09:09:55] <tgr_>	 ("that patch" being https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/1129771 )
[09:10:21] <tgr_>	 (we don't, Tpt was faster)
[09:10:50] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-2] "ats-tls is no longer in place in the CDN, HAProxy takes care of this" [puppet] - 10https://gerrit.wikimedia.org/r/1124216 (https://phabricator.wikimedia.org/T305794) (owner: 10BryanDavis)
[09:11:31] <tgr_>	 one sec, broke a bunch of tests
[09:12:16] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1129546|Enable SUL3 logins for 1% of group 1 users (T384153)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:12:20] <stashbot>	 T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153
[09:12:35] <jnuche>	 tgr_: ack, do you think we should ping Lucas Werkmeister about taking a look at the patch? he seemed to have more context about the whole thing
[09:14:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656174 (10phaultfinder)
[09:14:42] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,haproxy: Pass X-Analytics when X-Wikimedia-Debug is active [puppet] - 10https://gerrit.wikimedia.org/r/1129774 (https://phabricator.wikimedia.org/T305794)
[09:15:21] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-2] "please see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129774" [puppet] - 10https://gerrit.wikimedia.org/r/1124216 (https://phabricator.wikimedia.org/T305794) (owner: 10BryanDavis)
[09:15:32] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129774 (https://phabricator.wikimedia.org/T305794) (owner: 10Vgutierrez)
[09:15:35] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] Initial stub role for mariadb::research [puppet] - 10https://gerrit.wikimedia.org/r/1129764 (https://phabricator.wikimedia.org/T389089) (owner: 10Muehlenhoff)
[09:15:44] <wikibugs>	 (03PS11) 10Ayounsi: netbox: refactor support for GraphQL queries [software/homer] - 10https://gerrit.wikimedia.org/r/1094291
[09:16:14] <wikibugs>	 (03CR) 10Ayounsi: netbox: refactor support for GraphQL queries (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi)
[09:16:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10656186 (10elukey) The reimage seems to fail after provisioning with UEFI, the partitioning step fails. This is the error that I see in /var/log/syslog:  ` Mar 20...
[09:18:37] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15:00:00 on db2165.codfw.wmnet with reason: Maintenance
[09:18:52] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1129775
[09:19:11] <wikibugs>	 (03PS1) 10Elukey: installserver: fix preseed config for puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1129776 (https://phabricator.wikimedia.org/T381274)
[09:19:18] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host puppetserver2004.codfw.wmnet with OS bookworm
[09:19:19] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1129775 (owner: 10Marostegui)
[09:21:21] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[09:23:32] <wikibugs>	 (03PS1) 10Ladsgroup: Bump thumbnail steps to 30% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129778 (https://phabricator.wikimedia.org/T360589)
[09:28:47] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129546|Enable SUL3 logins for 1% of group 1 users (T384153)]] (duration: 25m 47s)
[09:28:51] <stashbot>	 T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153
[09:29:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656221 (10phaultfinder)
[09:29:39] <tgr_>	 fixed the tests
[09:29:48] <tgr_>	 let's see if Tpt is still watching
[09:30:20] <Amir1>	 can I quickly deploy a config patch in between?
[09:30:49] <tgr_>	 I'm done with mine
[09:31:08] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Bump thumbnail steps to 30% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129778 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[09:31:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129778 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[09:31:47] <Amir1>	 thanks. I'll be done in a couple of minutes
[09:31:59] <wikibugs>	 (03Merged) 10jenkins-bot: Bump thumbnail steps to 30% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129778 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[09:32:26] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1129778|Bump thumbnail steps to 30% (T360589)]]
[09:32:30] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[09:32:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet
[09:34:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet
[09:35:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti5004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1129261 (owner: 10Muehlenhoff)
[09:35:29] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1129778|Bump thumbnail steps to 30% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:37:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:37:40] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Remove HE through SG.IX [homer/public] - 10https://gerrit.wikimedia.org/r/1129320 (https://phabricator.wikimedia.org/T386987) (owner: 10Ayounsi)
[09:38:16] <wikibugs>	 (03Merged) 10jenkins-bot: Remove HE through SG.IX [homer/public] - 10https://gerrit.wikimedia.org/r/1129320 (https://phabricator.wikimedia.org/T386987) (owner: 10Ayounsi)
[09:38:59] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[09:39:21] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] logstash: read k8s-mw topics as needed [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) (owner: 10Filippo Giunchedi)
[09:39:39] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:39:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656234 (10phaultfinder)
[09:40:14] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] hieradata: move prometheus k8s instances off prometheus2006 [puppet] - 10https://gerrit.wikimedia.org/r/1129173 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[09:42:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1129776 (https://phabricator.wikimedia.org/T381274) (owner: 10Elukey)
[09:44:53] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet
[09:46:22] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129778|Bump thumbnail steps to 30% (T360589)]] (duration: 13m 55s)
[09:46:26] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[09:50:09] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve2002.codfw.wmnet with OS bookworm
[09:50:40] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve2002.codfw.wmnet with OS bookworm
[09:50:58] <wikibugs>	 (03CR) 10Elukey: [C:03+2] installserver: fix preseed config for puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1129776 (https://phabricator.wikimedia.org/T381274) (owner: 10Elukey)
[09:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:52:03] <wikibugs>	 (03PS16) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227)
[09:52:06] <wikibugs>	 (03CR) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[09:52:22] <tgr_>	 jnuche: I don't know what to do about the remaining CI error. It seems related to the patch but I can't reproduce it locally. We can either accept the CI break, force it through and test it in production (the test is a structure test for Special:Longpages so that would be straightforward), or wait until someone more familiar with Proofreadpage and/or namespace handling shows up.
[09:54:34] <jnuche>	 tgr_: if we backport up to the mwdebug servers, could you run the test there before syncing out everywhere else?
[09:55:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/main (k8s) 1.602s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:55:48] <tgr_>	 running PHPUnit tests in production sounds scary
[09:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:56:39] <tgr_>	 probably wouldn't work anyway, no composer dev dependencies etc. If it did work, I would be afraid of it making live DB or cache changes somehow.
[09:56:49] <tgr_>	 If you mean test manually, sure I can do that
[09:57:01] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti5004.eqsin.wmnet
[09:57:08] <jnuche>	 tgr_: yeah, I meant the manual test for Special:Longpages you were proposing
[09:58:06] <tgr_>	 yeah I can do that
[09:58:19] <tgr_>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/1129771/comments/d26bba0b_8fbbfe91
[09:58:41] <Amir1>	 https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-noselenium/82271/consoleFull
[09:58:53] <tgr_>	 looks quite nasty, like some kind of loop when generating namespaces
[09:58:57] <Amir1>	 That db query can ... cause issues,let's say
[09:58:58] <jnuche>	 give me a min, I'm still trying to wrap my head around the CI errors
[09:59:08] <tgr_>	 but I reviewed the ProofreadPage code and it's definitely loop-free
[09:59:30] <tgr_>	 the tests pass locally, and NamespaceInfo is used in all kinds of places, not just those two special pages
[09:59:36] <tgr_>	 so no clue what's going on there
[09:59:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656273 (10phaultfinder)
[09:59:40] <Amir1>	 maybe something is wrong with SpecialLongPages?
[10:00:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/main (k8s) 1.253s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:00:58] <tgr_>	 maybe
[10:01:05] <tgr_>	 the code seems pretty normal: https://gerrit.wikimedia.org/g/mediawiki/core/+/c24c8735d78abf33a8ed475c88379ac7588ce213/includes/specials/SpecialShortPages.php#60
[10:01:31] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[10:01:50] <jnuche>	 ok, yeah, that doesn't look good. I'd feel better if we can have someone else to take a look before merging
[10:04:09] <tgr_>	 reverse-engineering from that test error, it seems like NamespaceInfo::getContentNamespaces() repeats an infinite number of times
[10:04:34] <tgr_>	 ...repeats the ProofreadPage namespaces an infinite number of times
[10:04:39] <tgr_>	 but only in that one test
[10:07:01] <tgr_>	 there must be some sort of loop that causes the MediaWikiServices hook to be called infinite times
[10:07:44] <tgr_>	 hm, do we isolate globals between tests?
[10:08:26] <tgr_>	 maybe it's just a matter of ProofreadPage manipulating globals directly, and then every time a ProofreadPage testcase runs, it adds more namespaces
[10:09:32] <moritzm>	 !log installing gunicorn security updates
[10:09:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:49] <dcausse>	 tgr_: yes possibly there's a $wgContentNamespaces[] = $wgProofreadPageNamespaceIds[$key]
[10:13:07] <wikibugs>	 (03PS1) 10Muehlenhoff: klaxon: Drop conditional for buster [puppet] - 10https://gerrit.wikimedia.org/r/1129784
[10:13:24] <wikibugs>	 (03PS2) 10Muehlenhoff: klaxon: Drop conditional for buster [puppet] - 10https://gerrit.wikimedia.org/r/1129784
[10:14:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656302 (10phaultfinder)
[10:18:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Initial stub role for mariadb::research [puppet] - 10https://gerrit.wikimedia.org/r/1129764 (https://phabricator.wikimedia.org/T389089) (owner: 10Muehlenhoff)
[10:21:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Bitu: Add approval config for airflow-research-ops [puppet] - 10https://gerrit.wikimedia.org/r/1128357 (owner: 10Muehlenhoff)
[10:21:42] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve2002.codfw.wmnet with OS bookworm
[10:22:25] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] haproxy: using tmpfs directory for private tls material (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[10:24:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] klaxon: Drop conditional for buster [puppet] - 10https://gerrit.wikimedia.org/r/1129784 (owner: 10Muehlenhoff)
[10:24:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656342 (10phaultfinder)
[10:25:09] <wikibugs>	 (03PS2) 10Filippo Giunchedi: misc: report search-grafana-dashboards results details in markdown [software] - 10https://gerrit.wikimedia.org/r/1129242
[10:25:33] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472 (10elukey) 03NEW
[10:26:03] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472#10656369 (10elukey) The host is completely depooled, please take any action that you need to do :)
[10:26:56] <wikibugs>	 (03PS1) 10Vgutierrez: acme_chief::cloud: Avoid leaking designate secrets [puppet] - 10https://gerrit.wikimedia.org/r/1129786
[10:28:26] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1129377 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans)
[10:31:13] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm
[10:33:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] klaxon: Drop conditional for buster [puppet] - 10https://gerrit.wikimedia.org/r/1129784 (owner: 10Muehlenhoff)
[10:34:13] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1129786 (owner: 10Vgutierrez)
[10:34:29] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] acme_chief::cloud: Avoid leaking designate secrets [puppet] - 10https://gerrit.wikimedia.org/r/1129786 (owner: 10Vgutierrez)
[10:38:41] <elukey>	 !log restart imposm.service on maps1009 - T389462
[10:38:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:45] <stashbot>	 T389462: OSM replication lag on maps1009 - https://phabricator.wikimedia.org/T389462
[10:39:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656421 (10phaultfinder)
[10:42:58] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host puppetserver2004.codfw.wmnet with OS bookworm
[10:43:14] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm
[10:44:23] <moritzm>	 !log installing Java security updates on idp hosts
[10:44:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:43] <wikibugs>	 (03PS6) 10Ayounsi: Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641)
[10:44:43] <wikibugs>	 (03PS2) 10Ayounsi: Add transit/peering in/out port saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/1128429 (https://phabricator.wikimedia.org/T384052)
[10:47:12] <wikibugs>	 (03CR) 10Ayounsi: "Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1128429 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi)
[10:51:03] <wikibugs>	 (03PS2) 10Phuedx: ext-EventStreamConfig: Reduce product_metrics.web_base data collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129270
[10:55:36] <tgr_>	 jnuche: we are good to go, but out of time I guess?
[10:55:59] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'clear' for AS: 52999
[10:56:08] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'clear' for AS: 52999
[10:58:10] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting P{lvs7003.magru.wmnet,lvs1013.eqiad.wmnet} and A:liberica
[10:58:17] <wikibugs>	 (03PS17) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227)
[10:58:31] <wikibugs>	 (03CR) 10Fabfur: haproxy: using tmpfs directory for private tls material (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[10:59:04] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P{lvs7003.magru.wmnet,lvs1013.eqiad.wmnet} and A:liberica
[10:59:07] <jnuche>	 jouncebot: nowandnext
[10:59:07] <jouncebot>	 For the next 0 hour(s) and 0 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T0900)
[10:59:07] <jouncebot>	 In 0 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1100)
[10:59:25] <jnuche>	 going to ask if we can squeeze in the backport
[10:59:38] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting P{lvs[7001-7002].magru.wmnet} and A:liberica
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1100)
[11:00:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:00:57] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P{lvs[7001-7002].magru.wmnet} and A:liberica
[11:04:02] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[11:05:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:05:59] <wikibugs>	 (03PS5) 10Slyngshede: Upgrade CAS to version 7.1.4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1111636
[11:08:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[11:10:52] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "current CR breaks OCSP response stapling for certificates deployed by sslcert::certificate" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[11:13:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover idp to idp1004 [dns] - 10https://gerrit.wikimedia.org/r/1129788
[11:14:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656580 (10phaultfinder)
[11:15:52] <jnuche>	 tgr_: if you're still around, I think we can go ahead with backporting the fix
[11:17:49] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host puppetserver2004.codfw.wmnet with OS bookworm
[11:18:00] <tgr_>	 yay!
[11:18:29] <tgr_>	 zuul on master seems hopelessly backlogged but the normal tests pass so I think that's good enough
[11:18:56] <wikibugs>	 (03PS1) 10Gergő Tisza: Use MediaWikiServices for early config changes [extensions/ProofreadPage] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129789 (https://phabricator.wikimedia.org/T288819)
[11:19:14] <jnuche>	 looks like the gate jobs finally made it through
[11:19:32] <jnuche>	 (as in, started running)
[11:19:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656606 (10phaultfinder)
[11:20:17] <tgr_>	 are you backporting, or should I?
[11:22:35] <jnuche>	 tgr_: can you do the honors? :)
[11:22:45] <jnuche>	 tgr_: wait
[11:23:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Add db1300 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1129790 (https://phabricator.wikimedia.org/T389089)
[11:23:36] <jnuche>	 tgr_: nvm, I thought SRE may have an issue with backporting now
[11:23:39] <jnuche>	 seems it's ok
[11:26:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add db1300 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1129790 (https://phabricator.wikimedia.org/T389089) (owner: 10Muehlenhoff)
[11:26:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/ProofreadPage] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129789 (https://phabricator.wikimedia.org/T288819) (owner: 10Gergő Tisza)
[11:31:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host db1300.eqiad.wmnet
[11:31:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[11:32:21] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[11:33:08] <wikibugs>	 (03PS1) 10Brouberol: Query the wiki API through envoy when running in kubernetes [dumps] - 10https://gerrit.wikimedia.org/r/1129793 (https://phabricator.wikimedia.org/T388378)
[11:35:17] <wikibugs>	 (03PS18) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227)
[11:36:18] <wikibugs>	 (03CR) 10Fabfur: haproxy: using tmpfs directory for private tls material (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[11:37:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db1300.eqiad.wmnet - jmm@cumin2002"
[11:37:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db1300.eqiad.wmnet - jmm@cumin2002"
[11:37:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:37:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache db1300.eqiad.wmnet on all recursors
[11:37:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db1300.eqiad.wmnet on all recursors
[11:37:38] <moritzm>	 !log instaling debootstrap bugfix updates
[11:37:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:42] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[11:38:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db1300.eqiad.wmnet - jmm@cumin2002"
[11:38:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db1300.eqiad.wmnet - jmm@cumin2002"
[11:39:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74277 and previous config saved to /var/cache/conftool/dbconfig/20250320-113918-root.json
[11:40:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10656640 (10MoritzMuehlenhoff)
[11:40:44] <wikibugs>	 (03Merged) 10jenkins-bot: Use MediaWikiServices for early config changes [extensions/ProofreadPage] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129789 (https://phabricator.wikimedia.org/T288819) (owner: 10Gergő Tisza)
[11:41:16] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1129789|Use MediaWikiServices for early config changes (T288819 T389430)]]
[11:41:20] <stashbot>	 T288819: NamespaceInfo service missing namespaces if initialized too early - https://phabricator.wikimedia.org/T288819
[11:41:20] <stashbot>	 T389430: Page and Index namespaces from ProofreadPage extension no longer considered content namespaces since deploy of 1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389430
[11:42:13] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm
[11:42:43] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM, one comment in line." [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[11:42:55] <tgr_>	 dcausse: should I test something specific for the ProofreadPage patch, other than namespace info in the siteinfo API?
[11:44:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656659 (10phaultfinder)
[11:45:10] <wikibugs>	 (03PS1) 10Zoe: Re-enable creation of Flow pages for sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129795
[11:45:38] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Query the wiki API through envoy when running in kubernetes [dumps] - 10https://gerrit.wikimedia.org/r/1129793 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[11:46:51] <wikibugs>	 (03CR) 10Cathal Mooney: "Nice!  Overall it lgtm if everyone is in agreement." [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi)
[11:47:11] <wikibugs>	 (03PS2) 10Zoe: Re-enable creation of Flow pages for sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129795 (https://phabricator.wikimedia.org/T380911)
[11:47:40] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Add transit/peering in/out port saturation alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1128429 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi)
[11:48:12] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1129789|Use MediaWikiServices for early config changes (T288819 T389430)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:48:17] <stashbot>	 T288819: NamespaceInfo service missing namespaces if initialized too early - https://phabricator.wikimedia.org/T288819
[11:48:17] <stashbot>	 T389430: Page and Index namespaces from ProofreadPage extension no longer considered content namespaces since deploy of 1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389430
[11:51:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host db1300.eqiad.wmnet with OS bookworm
[11:53:44] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage
[11:54:13] <tgr_>	 jnuche: I'm trying to find a wiki where I can test the fix; https://versions.toolforge.org/ says group 0 is on wmf.21, but https://test2.wikipedia.org/wiki/Special:Version says it's on wmf.20
[11:54:20] <tgr_>	 I guess the version tool is wrong?
[11:54:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74278 and previous config saved to /var/cache/conftool/dbconfig/20250320-115423-root.json
[11:55:14] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10656702 (10Ladsgroup)
[11:55:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656703 (10phaultfinder)
[11:56:01] <jnuche>	 tgr_: `test2` is actually group1, but `test` should have the fix if you can test there: https://test.wikipedia.org/wiki/Special:Version
[11:56:35] <tgr_>	 it doesn't use ProofReadpage though
[11:56:52] <tgr_>	 I guess not really testable then
[11:57:08] <tgr_>	 I can test during train rollout if that's OK
[11:57:22] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage
[11:57:23] <jnuche>	 jouncebot: nowandnext
[11:57:24] <jouncebot>	 For the next 0 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1100)
[11:57:24] <jouncebot>	 In 0 hour(s) and 2 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1200)
[11:57:34] <jnuche>	 tgr_: yeah, let's do that
[11:57:53] <jnuche>	 ok, I'm going to roll out the train to group2 in a few minutes
[11:57:55] <tgr_>	 or I guess closed wikisources would be group 0
[11:58:28] <jnuche>	 I see a couple of wikisource wikis in group0, yes
[11:58:35] <jnuche>	 you want me to hold on?
[11:59:04] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Query the wiki API through envoy when running in kubernetes [dumps] - 10https://gerrit.wikimedia.org/r/1129793 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[11:59:49] <tgr_>	 eh
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1200)
[12:00:28] <tgr_>	 I'm trying ht.wikisource which is definitely wmf.21, but even without the fix the siteinfo API says all the ProofreadPage namespaces are content
[12:00:42] <tgr_>	 so maybe the bug only occurs in a more specific situation?
[12:00:46] <tgr_>	 let me finish the backport
[12:01:07] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[12:01:10] <tgr_>	 or will the train scap take care of that anyway?
[12:02:21] <jnuche>	 tgr_: yeah, if you finish the backport you will get the fix in ht.wikisource 
[12:02:32] <jnuche>	 sorry, I didn't realize you had stopped at testservers sync
[12:02:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1300.eqiad.wmnet with reason: host reimage
[12:02:45] <jnuche>	 it should be safe to finish, you should go ahead
[12:03:20] <tgr_>	 I mean the fix is now on the test servers but absolutely no difference in behavior with or without, I can't reproduce the bug
[12:03:26] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[12:03:56] <tgr_>	 I also can't test Special:Longpages because in production that's a daily job (although I'm very confident that was just a cross-test pollution issue)
[12:04:32] <jnuche>	 tgr_: sry, missed that, doing couple things at the same time
[12:04:59] <jnuche>	 if the problem still persists, do you have an idea how bad the impact could be if we roll all the way to group2?
[12:05:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1300.eqiad.wmnet with reason: host reimage
[12:05:24] <jnuche>	 *in case the problem still persists after your fix
[12:05:50] <wikibugs>	 (03PS1) 10Clément Goubert: modules.cache.mcrouter: Copy for new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129802 (https://phabricator.wikimedia.org/T389480)
[12:05:57] <wikibugs>	 (03PS1) 10Clément Goubert: modules.cache.mcrouter: Allow exporter port config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129803 (https://phabricator.wikimedia.org/T389480)
[12:06:12] <wikibugs>	 (03PS1) 10Clément Goubert: mcrouter: Update cache.mcrouter to 1.3.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129804 (https://phabricator.wikimedia.org/T389480)
[12:06:16] <tgr_>	 it would break search, no idea about timing (how much time until it affects the ES index / how much time until a revert affects the index)
[12:06:41] <wikibugs>	 (03PS1) 10Aklapper: phabricator weekly changes email: List tasks "in progress" for >2y [puppet] - 10https://gerrit.wikimedia.org/r/1129806 (https://phabricator.wikimedia.org/T380300)
[12:07:12] <tgr_>	 FWIW we had the exact same issue with Wikibase a week ago and the same fix worked there
[12:08:31] <jnuche>	 EBernhardson added a few notes here on how they debugged the issue: https://phabricator.wikimedia.org/T389430#10654683
[12:08:56] <jnuche>	 would it be possible to do the same thing in mwdebug1002 right now and verify we get a trace similar to wmf.20?
[12:09:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74279 and previous config saved to /var/cache/conftool/dbconfig/20250320-120928-root.json
[12:10:51] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129789|Use MediaWikiServices for early config changes (T288819 T389430)]] (duration: 29m 34s)
[12:10:56] <stashbot>	 T288819: NamespaceInfo service missing namespaces if initialized too early - https://phabricator.wikimedia.org/T288819
[12:10:56] <stashbot>	 T389430: Page and Index namespaces from ProofreadPage extension no longer considered content namespaces since deploy of 1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389430
[12:11:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:11:24] <jnuche>	 similar to wmf.20, or whatever load trace is expected after the changes
[12:12:26] <tgr_>	 I do see the 250/252 namespaces
[12:13:54] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver2004.codfw.wmnet with OS bookworm
[12:14:14] <tgr_>	 but then I'm pretty sure those commands are identical to looking at the siteinfo API
[12:14:23] <jnuche>	 tgr_: that sounds promising, how about I roll out to group1 and then we check in one of the wikis with ProofreadPage there?
[12:14:41] <tgr_>	 so for some reason the bug doesn't seem reproducible on the group0 wikisources in the first place
[12:14:50] <tgr_>	 yeah, let's do that
[12:14:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10656759 (10elukey) Host up and running with UEFI and Bookworm :)
[12:15:01] <jnuche>	 all aboard the train
[12:15:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Failover idp to idp1004 [dns] - 10https://gerrit.wikimedia.org/r/1129788 (owner: 10Muehlenhoff)
[12:15:29] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[12:15:33] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129809 (https://phabricator.wikimedia.org/T386216)
[12:15:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129809 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot)
[12:16:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:16:26] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129809 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot)
[12:17:38] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[12:18:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:21:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1300.eqiad.wmnet with OS bookworm
[12:21:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host db1300.eqiad.wmnet
[12:23:15] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:24:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74280 and previous config saved to /var/cache/conftool/dbconfig/20250320-122433-root.json
[12:28:39] <logmsgbot>	 !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.21  refs T386216
[12:28:43] <stashbot>	 T386216: 1.44.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T386216
[12:30:07] <jnuche>	 tgr_: we're at group1
[12:32:24] <tgr_>	 I re-did ebernhardson's tests on the same wiki he used and it looks correct (the namespace IDs include 100/102 for all three commands)
[12:33:55] <jnuche>	 🎉
[12:34:27] <jnuche>	 awesome, going to wait a couple of minutes and then I'll continue deploying to group2
[12:34:54] <jnuche>	 tgr_: thanks for the fix and following up on this
[12:36:23] <moritzm>	 !log installing openjdk 17 security updates on puppet servers (the necessary restarts may cause a few interrupted puppet runs and will be splayed out)
[12:36:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74281 and previous config saved to /var/cache/conftool/dbconfig/20250320-123939-root.json
[12:43:01] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-mcrouter: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129816
[12:44:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656864 (10phaultfinder)
[12:45:00] <wikibugs>	 (03Abandoned) 10Kosta Harlan: GlobalUserSelectQueryBuilder: Ignore unattached local users [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126980 (https://phabricator.wikimedia.org/T388125) (owner: 10Máté Szabó)
[12:45:27] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129817 (https://phabricator.wikimedia.org/T386216)
[12:45:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129817 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot)
[12:45:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:48:16] <moritzm>	 ^ the drmrs failures are transient, caused by the Java update on puppet servers
[12:49:56] <wikibugs>	 (03CR) 10Jaime Nuche: [V:03+2] group2 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129817 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot)
[12:50:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web/main (k8s) 1.378s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:51:18] <wikibugs>	 (03CR) 10Effie Mouzeli: "I think setting monitoring.named_ports:true will tidy things up" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129803 (https://phabricator.wikimedia.org/T389480) (owner: 10Clément Goubert)
[12:53:44] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[12:53:52] <wikibugs>	 (03PS1) 10Sergio Gimeno: analytics(GrowthExperimentsInteractionLogger): add edit_count to the event data [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129819 (https://phabricator.wikimedia.org/T388622)
[12:54:20] <wikibugs>	 (03PS1) 10Sergio Gimeno: feat(SurfacingStructuredTasks): increase max edit cap to 100 [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129820 (https://phabricator.wikimedia.org/T388622)
[12:54:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:55:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet
[12:55:14] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129819 (https://phabricator.wikimedia.org/T388622) (owner: 10Sergio Gimeno)
[12:55:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web/main (k8s) 1.378s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:55:24] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet
[12:55:26] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129820 (https://phabricator.wikimedia.org/T388622) (owner: 10Sergio Gimeno)
[12:56:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of testvm2002.codfw.wmnet to drbd
[13:00:07] <wikibugs>	 (03PS1) 10Gergő Tisza: Enable SUL3 login for 10% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129821 (https://phabricator.wikimedia.org/T384153)
[13:01:13] <tgr_>	 jouncebot: now
[13:01:13] <jouncebot>	 For the next 0 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1300)
[13:02:12] <tgr_>	 looks like the bot stopped announcing windows
[13:02:18] <logmsgbot>	 !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.21  refs T386216
[13:02:22] <stashbot>	 T386216: 1.44.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T386216
[13:02:50] <sergi0>	 yep, maybe dst confusion?
[13:02:53] <sergi0>	 o/
[13:03:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of testvm2002.codfw.wmnet to drbd
[13:03:38] <sergi0>	 I'm gonna self-deploy my changes
[13:03:50] <jnuche>	 please hold
[13:04:05] <jnuche>	 train just finished deploying, I need to check logs
[13:04:13] <jnuche>	 sergi0: ^
[13:04:26] <sergi0>	 ack
[13:04:58] <wikibugs>	 (03PS2) 10Filippo Giunchedi: logstash: read k8s-mw topics as needed [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335)
[13:05:57] <wikibugs>	 (03PS1) 10Effie Mouzeli: thumbor: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129824 (https://phabricator.wikimedia.org/T389480)
[13:05:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] mcrouter: Update cache.mcrouter to 1.3.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129804 (https://phabricator.wikimedia.org/T389480) (owner: 10Clément Goubert)
[13:06:37] <tgr_>	 I might add a config patch in a while
[13:10:09] <jnuche>	 sergi0: thanks for waiting, you can go ahead with backports
[13:10:22] <sergi0>	 great, ty!
[13:10:34] <wikibugs>	 (03PS2) 10Effie Mouzeli: thumbor: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129824 (https://phabricator.wikimedia.org/T389480)
[13:10:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129819 (https://phabricator.wikimedia.org/T388622) (owner: 10Sergio Gimeno)
[13:10:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129820 (https://phabricator.wikimedia.org/T388622) (owner: 10Sergio Gimeno)
[13:11:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet
[13:11:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet
[13:12:59] <wikibugs>	 (03Merged) 10jenkins-bot: analytics(GrowthExperimentsInteractionLogger): add edit_count to the event data [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129819 (https://phabricator.wikimedia.org/T388622) (owner: 10Sergio Gimeno)
[13:14:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet
[13:17:50] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: common.yaml: remove firewall rules for kafka-main100[1-5] [puppet] - 10https://gerrit.wikimedia.org/r/1100807 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli)
[13:19:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657024 (10phaultfinder)
[13:20:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[13:25:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657055 (10phaultfinder)
[13:27:54] <wikibugs>	 (03Merged) 10jenkins-bot: feat(SurfacingStructuredTasks): increase max edit cap to 100 [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129820 (https://phabricator.wikimedia.org/T388622) (owner: 10Sergio Gimeno)
[13:28:14] <logmsgbot>	 !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1129819|analytics(GrowthExperimentsInteractionLogger): add edit_count to the event data (T388622)]], [[gerrit:1129820|feat(SurfacingStructuredTasks): increase max edit cap to 100 (T388622)]]
[13:28:17] <stashbot>	 T388622: Increase target audience for Surfacing Structured Task Experiment - https://phabricator.wikimedia.org/T388622
[13:29:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet
[13:30:36] <moritzm>	 !log remove ganeti-test2001 for reimage T382515
[13:30:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:40] <stashbot>	 T382515: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515
[13:31:06] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1129819|analytics(GrowthExperimentsInteractionLogger): add edit_count to the event data (T388622)]], [[gerrit:1129820|feat(SurfacingStructuredTasks): increase max edit cap to 100 (T388622)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:32:01] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Continuing with sync
[13:34:28] <wikibugs>	 (03PS1) 10Slyngshede: P:mirrors add file age exporter [puppet] - 10https://gerrit.wikimedia.org/r/1129827 (https://phabricator.wikimedia.org/T350694)
[13:35:07] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129828
[13:35:23] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129829
[13:35:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129821 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza)
[13:36:33] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129830
[13:36:43] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129831
[13:38:31] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5118/co" [puppet] - 10https://gerrit.wikimedia.org/r/1129827 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[13:39:14] <logmsgbot>	 !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129819|analytics(GrowthExperimentsInteractionLogger): add edit_count to the event data (T388622)]], [[gerrit:1129820|feat(SurfacingStructuredTasks): increase max edit cap to 100 (T388622)]] (duration: 11m 00s)
[13:39:18] <stashbot>	 T388622: Increase target audience for Surfacing Structured Task Experiment - https://phabricator.wikimedia.org/T388622
[13:40:30] <sergi0>	 I'm done with my changes, tgr_ you want to take yours or I can do it if you want
[13:41:19] <tgr_>	 sergi0: sure, thanks
[13:42:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129821 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza)
[13:42:29] <tgr_>	 it's not really testable
[13:44:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Create EFI-enabled Partman recipe for Ganeti in core sites [puppet] - 10https://gerrit.wikimedia.org/r/1128875 (owner: 10Muehlenhoff)
[13:44:08] <sergi0>	 ok, just curious, what signal do you normally look at after a SUL3 rollout?
[13:44:24] <wikibugs>	 (03Merged) 10jenkins-bot: Enable SUL3 login for 10% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129821 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza)
[13:44:43] <logmsgbot>	 !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1129821|Enable SUL3 login for 10% of group 1 users (T384153)]]
[13:44:47] <stashbot>	 T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153
[13:45:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti-test2001 to EFI [puppet] - 10https://gerrit.wikimedia.org/r/1129832 (https://phabricator.wikimedia.org/T382515)
[13:47:47] <logmsgbot>	 !log sgimeno@deploy2002 tgr, sgimeno: Backport for [[gerrit:1129821|Enable SUL3 login for 10% of group 1 users (T384153)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:48:03] <sergi0>	 tgr_: should I proceed with sync then?
[13:48:13] <tgr_>	 yes, thanks
[13:48:19] <logmsgbot>	 !log sgimeno@deploy2002 tgr, sgimeno: Continuing with sync
[13:48:32] <tgr_>	 I'll look at error logs in a few hours
[13:48:49] <sergi0>	 👍
[13:49:18] <tgr_>	 plus we have some statsd charts about authentication action frequencies and error rates
[13:49:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657160 (10phaultfinder)
[13:50:03] <tgr_>	 admittedly not terribly useful because there are so many weird scrapers which almost but not quite simulate human browsing behavior, it's mostly noise
[13:50:13] <wikibugs>	 (03PS2) 10Slyngshede: P:mirrors add file age exporter [puppet] - 10https://gerrit.wikimedia.org/r/1129827 (https://phabricator.wikimedia.org/T350694)
[13:50:24] <tgr_>	 so in practice it's mostly just error logs and human error reports
[13:51:05] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5119/co" [puppet] - 10https://gerrit.wikimedia.org/r/1129827 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[13:51:47] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "please add an additional check that ensure that no cert is being configured to use the on-disk paths if the volatile TLS storage is enable" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[13:52:35] <sergi0>	 gotcha, thanks for explaining
[13:53:03] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[13:55:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: Add new profile (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1129595 (owner: 10Chuckonwumelu)
[13:56:02] <logmsgbot>	 !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129821|Enable SUL3 login for 10% of group 1 users (T384153)]] (duration: 11m 18s)
[13:56:06] <stashbot>	 T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153
[13:56:17] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[13:56:50] <sergi0>	 tgr_: your change is live
[13:57:22] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] "(Post merge +1, for completeness, and as per Slack conversations.)" [puppet] - 10https://gerrit.wikimedia.org/r/1129343 (https://phabricator.wikimedia.org/T382484) (owner: 10Btullis)
[13:57:29] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[13:57:44] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[13:58:11] <tgr_>	 thanks!
[13:58:18] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[13:59:59] <wikibugs>	 (03PS3) 10DCausse: cirrus-streaming-updater: consume from new v1 & legacy rc0 streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124484 (https://phabricator.wikimedia.org/T375821)
[13:59:59] <wikibugs>	 (03PS3) 10DCausse: cirrus-streaming-updater: produce to v1 update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124485 (https://phabricator.wikimedia.org/T375821)
[13:59:59] <wikibugs>	 (03PS3) 10DCausse: cirrus-streaming-updater: stop consuming from legacy streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124486 (https://phabricator.wikimedia.org/T375821)
[14:00:02] <wikibugs>	 06SRE, 10observability, 10Observability-Logging, 10Wikimedia-Logstash: pybal logs into logstash - https://phabricator.wikimedia.org/T223924#10657173 (10fgiunchedi) 05Open→03Declined pybal is being replaced by liberica
[14:00:04] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10657175 (10Jhancock.wm)
[14:01:00] <Dreamy_Jazz>	 jouncebot: nowandnext
[14:01:00] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 58 minute(s)
[14:01:01] <jouncebot>	 In 0 hour(s) and 58 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1500)
[14:01:38] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10657185 (10Jhancock.wm) absolutely agree after the all the work I see y'all doing. I've pulled a random disk and reinserted. l...
[14:01:46] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10657186 (10Jhancock.wm) a:03Jhancock.wm
[14:02:42] <wikibugs>	 (03PS1) 10Dreamy Jazz: GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129837 (https://phabricator.wikimedia.org/T389187)
[14:03:12] <wikibugs>	 (03PS1) 10Dreamy Jazz: GlobalContributionsPagerTest: De-duplicate getting new pager [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129838
[14:03:51] <wikibugs>	 (03PS2) 10Dreamy Jazz: GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129837 (https://phabricator.wikimedia.org/T389187)
[14:04:03] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] GlobalContributionsPagerTest: De-duplicate getting new pager [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129838 (owner: 10Dreamy Jazz)
[14:04:06] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129837 (https://phabricator.wikimedia.org/T389187) (owner: 10Dreamy Jazz)
[14:06:42] <wikibugs>	 (03PS1) 10Dreamy Jazz: GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129839 (https://phabricator.wikimedia.org/T389187)
[14:08:15] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: consume from new v1 & legacy rc0 streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124484 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse)
[14:08:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/main (k8s) 1.542s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:08:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: upgrade search plugins - bking@cumin2002 - T389119
[14:08:43] <stashbot>	 T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119
[14:08:48] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:09:52] <Dreamy_Jazz>	 Going to deploy some wmf backports
[14:10:09] <wikibugs>	 (03PS2) 10Dreamy Jazz: GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129839 (https://phabricator.wikimedia.org/T389187)
[14:10:16] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: consume from new v1 & legacy rc0 streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124484 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse)
[14:10:29] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129839 (https://phabricator.wikimedia.org/T389187) (owner: 10Dreamy Jazz)
[14:11:00] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: upgrade search plugins - bking@cumin2002 - T389119
[14:11:33] <wikibugs>	 14SRE-grizzly-sprint, 10Observability-Metrics: Grizzly: upgrade to 0.2 - https://phabricator.wikimedia.org/T332892#10657217 (10fgiunchedi) 05Open→03Invalid We have replaced Grizzly with Pyrra
[14:12:05] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119
[14:12:20] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[14:12:37] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:13:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/main (k8s) 1.281s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:13:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129837 (https://phabricator.wikimedia.org/T389187) (owner: 10Dreamy Jazz)
[14:13:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129838 (owner: 10Dreamy Jazz)
[14:13:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129839 (https://phabricator.wikimedia.org/T389187) (owner: 10Dreamy Jazz)
[14:21:44] <wikibugs>	 (03PS1) 10Gergő Tisza: Clear stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433)
[14:22:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Clear stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) (owner: 10Gergő Tisza)
[14:23:14] <wikibugs>	 (03PS1) 10Vgutierrez: sre: Add LibericaStaleConfig alert [alerts] - 10https://gerrit.wikimedia.org/r/1129846 (https://phabricator.wikimedia.org/T389175)
[14:23:25] <wikibugs>	 (03Merged) 10jenkins-bot: GlobalContributionsPagerTest: De-duplicate getting new pager [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129838 (owner: 10Dreamy Jazz)
[14:23:26] <wikibugs>	 (03Merged) 10jenkins-bot: GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129837 (https://phabricator.wikimedia.org/T389187) (owner: 10Dreamy Jazz)
[14:24:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657242 (10phaultfinder)
[14:27:53] <wikibugs>	 (03CR) 10Bking: "Per Slack conversation with @aotto@wikimedia.org, DPE should not be affected. CCing our Search Platform SWEs for review" [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) (owner: 10Filippo Giunchedi)
[14:32:27] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:33:39] <wikibugs>	 (03CR) 10Ahmon Dancy: "Confirmed.  In T383947 new groups "spiderpig-users" and "spiderpig-admins" are proposed (although the latter is probably not needed)." [puppet] - 10https://gerrit.wikimedia.org/r/1129292 (https://phabricator.wikimedia.org/T383947) (owner: 10Muehlenhoff)
[14:33:42] <wikibugs>	 (03Merged) 10jenkins-bot: GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129839 (https://phabricator.wikimedia.org/T389187) (owner: 10Dreamy Jazz)
[14:34:02] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1129837|GlobalContributions: Do not look up permissions for registered target (T389187)]], [[gerrit:1129838|GlobalContributionsPagerTest: De-duplicate getting new pager]], [[gerrit:1129839|GlobalContributions: Do not look up permissions for registered target (T389187)]]
[14:34:06] <stashbot>	 T389187: GlobalContributions: Make displaying deleted revisions optional - https://phabricator.wikimedia.org/T389187
[14:35:06] <wikibugs>	 (03CR) 10Bking: "Upon further review, Search Platform SWEs do not believe we are affected by this change. Feel free to move forward." [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) (owner: 10Filippo Giunchedi)
[14:35:54] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119
[14:35:57] <stashbot>	 T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119
[14:36:27] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[14:38:50] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1129837|GlobalContributions: Do not look up permissions for registered target (T389187)]], [[gerrit:1129838|GlobalContributionsPagerTest: De-duplicate getting new pager]], [[gerrit:1129839|GlobalContributions: Do not look up permissions for registered target (T389187)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:40:08] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Thanks for the runbook link!" [alerts] - 10https://gerrit.wikimedia.org/r/1129846 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez)
[14:41:29] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync
[14:42:01] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:42:47] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM! Thank you for sharing some more of the computing resources :-)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129190 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol)
[14:43:59] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM! Thank you." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129191 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol)
[14:44:53] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-main: increase the scheduler resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129190 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol)
[14:44:56] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-test-k8s/main: allow egress to cassandra-analytics-query-service-storage-a-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129191 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol)
[14:46:23] <wikibugs>	 (03Merged) 10jenkins-bot: airflow-main: increase the scheduler resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129190 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol)
[14:46:27] <wikibugs>	 (03Merged) 10jenkins-bot: airflow-test-k8s/main: allow egress to cassandra-analytics-query-service-storage-a-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129191 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol)
[14:48:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1129827 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[14:49:07] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129837|GlobalContributions: Do not look up permissions for registered target (T389187)]], [[gerrit:1129838|GlobalContributionsPagerTest: De-duplicate getting new pager]], [[gerrit:1129839|GlobalContributions: Do not look up permissions for registered target (T389187)]] (duration: 15m 04s)
[14:49:11] <stashbot>	 T389187: GlobalContributions: Make displaying deleted revisions optional - https://phabricator.wikimedia.org/T389187
[14:49:44] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:49:56] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[14:50:01] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:50:08] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:51:03] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[14:51:29] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[14:52:24] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] sre: Add LibericaStaleConfig alert [alerts] - 10https://gerrit.wikimedia.org/r/1129846 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez)
[14:52:29] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[14:52:30] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:52:32] <Amir1>	 jouncebot: nowandnext
[14:52:33] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 7 minute(s)
[14:52:33] <jouncebot>	 In 0 hour(s) and 7 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1500)
[14:52:44] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[14:52:57] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:53:19] <wikibugs>	 (03PS1) 10Brouberol: Fix typo in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129850 (https://phabricator.wikimedia.org/T386282)
[14:54:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Can't vote with confidence, sorry!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129824 (https://phabricator.wikimedia.org/T389480) (owner: 10Effie Mouzeli)
[14:54:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657316 (10phaultfinder)
[14:57:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: try rolling operation without allow-yellow flag - bking@cumin2002 - T389119
[14:57:25] <stashbot>	 T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119
[14:57:50] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[14:58:01] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:58:40] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided)
[14:59:12] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided) (duration: 00m 33s)
[14:59:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:59:53] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Fix typo in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129850 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol)
[15:00:05] <jouncebot>	 jnuche and jeena: May I have your attention please! Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1500)
[15:00:34] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[15:00:38] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Switch ganeti-test2001 to EFI [puppet] - 10https://gerrit.wikimedia.org/r/1129832 (https://phabricator.wikimedia.org/T382515) (owner: 10Muehlenhoff)
[15:01:01] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[15:01:42] <wikibugs>	 (03PS6) 10BCornwall: cdn: Add roll-upgrade-varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387)
[15:01:52] <wikibugs>	 (03CR) 10BCornwall: cdn: Add roll-upgrade-varnish (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall)
[15:02:42] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided)
[15:03:28] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided) (duration: 00m 51s)
[15:03:29] <sukhe>	 not sure who added storcli but note: https://puppetboard.wikimedia.org/failures
[15:03:40] <sukhe>	 E: Problem with MergeList /var/lib/apt/lists/apt.wikimedia.org_wikimedia_dists_bookworm-wikimedia_thirdparty_hwraid_binary-amd64_Packages
[15:03:43] <sukhe>	 E: The package lists or status file could not be parsed or opened. 
[15:03:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472#10657360 (10Jhancock.wm) okay since this has happened before i pulled DIMM_B1 to see if it would boot without it. Got the same error on DIMM_B2. moved it to DIMM_B1. error move...
[15:03:49] <sukhe>	 this is causing a widespread puppet failure
[15:04:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472#10657361 (10Jhancock.wm) a:03Jhancock.wm
[15:04:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657363 (10phaultfinder)
[15:04:45] <jinxer-wm>	 FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:06:26] <moritzm>	 ^ should recover soon. I rolled back, this was caused by https://phabricator.wikimedia.org/T388628#10657364
[15:06:28] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:06:36] <sukhe>	 ah thank you
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:06:46] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:11:31] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:11:32] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:11:39] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:11:49] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:14:42] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: try rolling operation without allow-yellow flag - bking@cumin2002 - T389119
[15:14:47] <stashbot>	 T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119
[15:19:10] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp3067 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129854 (https://phabricator.wikimedia.org/T378737)
[15:19:11] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp3068 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129855 (https://phabricator.wikimedia.org/T378737)
[15:19:13] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp3069 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129856 (https://phabricator.wikimedia.org/T378737)
[15:19:14] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp3070 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129857 (https://phabricator.wikimedia.org/T378737)
[15:19:16] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp3071 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129858 (https://phabricator.wikimedia.org/T378737)
[15:19:17] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp3072 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129859 (https://phabricator.wikimedia.org/T378737)
[15:19:19] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp3073 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129860 (https://phabricator.wikimedia.org/T378737)
[15:19:23] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp3075 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129861 (https://phabricator.wikimedia.org/T378737)
[15:19:27] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp3076 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129862 (https://phabricator.wikimedia.org/T378737)
[15:19:31] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp3077 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129863 (https://phabricator.wikimedia.org/T378737)
[15:19:35] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp3078 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129864 (https://phabricator.wikimedia.org/T378737)
[15:19:39] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp3079 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129865 (https://phabricator.wikimedia.org/T378737)
[15:19:43] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp3080 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129866 (https://phabricator.wikimedia.org/T378737)
[15:19:47] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp3081 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129867 (https://phabricator.wikimedia.org/T378737)
[15:20:12] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudelastic[1007,1009-1012].eqiad.wmnet with reason: troubleshooting red status
[15:20:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti-test2001 to EFI [puppet] - 10https://gerrit.wikimedia.org/r/1129832 (https://phabricator.wikimedia.org/T382515) (owner: 10Muehlenhoff)
[15:21:38] <wikibugs>	 (03PS1) 10Andrew Bogott: typos: add 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1129868
[15:25:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657504 (10phaultfinder)
[15:27:18] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp3067 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129854 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:27:20] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp3068 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129855 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:27:27] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp3069 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129856 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:27:36] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp3070 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129857 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:27:40] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp3071 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129858 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:27:45] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp3072 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129859 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:27:53] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp3073 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129860 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:28:01] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp3075 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129861 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:28:10] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp3076 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129862 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:28:12] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp3077 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129863 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:28:22] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp3078 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129864 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:28:25] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp3079 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129865 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:28:30] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp3080 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129866 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:28:43] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp3081 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129867 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:29:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[15:29:39] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:29:45] <jinxer-wm>	 FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:32:21] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[15:32:28] <claime>	 jouncebot: nowandnext
[15:32:28] <jouncebot>	 For the next 0 hour(s) and 27 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1500)
[15:32:28] <jouncebot>	 In 0 hour(s) and 27 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1600)
[15:34:45] <jinxer-wm>	 RESOLVED: [3x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:36:08] <logmsgbot>	 !log cgoubert@deploy2002 Started scap sync-world: Build mediawiki-cli image - T389484
[15:36:14] <stashbot>	 T389484: Create a mediawiki-cli image - https://phabricator.wikimedia.org/T389484
[15:36:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:37:25] <wikibugs>	 (03CR) 10Eevans: [C:03+2] restbase: commission restbase1043 (refresh for restbase1028) [puppet] - 10https://gerrit.wikimedia.org/r/1129377 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans)
[15:38:46] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] upgrade cp3067 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129854 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[15:39:25] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] cdn: Add roll-upgrade-varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall)
[15:39:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[15:42:27] <logmsgbot>	 !log cgoubert@deploy2002 Finished scap sync-world: Build mediawiki-cli image - T389484 (duration: 06m 18s)
[15:42:31] <stashbot>	 T389484: Create a mediawiki-cli image - https://phabricator.wikimedia.org/T389484
[15:45:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657573 (10phaultfinder)
[15:47:35] <moritzm>	 !log installing node-postcss security updates
[15:47:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:25] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2248.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[15:48:38] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2248.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[15:49:24] <wikibugs>	 (03CR) 10Pppery: "This doesn't seem like the correct analysis of the cause - the maintenance script runs as "flow talk page manager", which should already h" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129795 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe)
[15:52:27] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1043-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:52:56] <wikibugs>	 (03PS1) 10Bking: relforge: move relforge1003 into OpenSearch role [puppet] - 10https://gerrit.wikimedia.org/r/1129877 (https://phabricator.wikimedia.org/T380752)
[15:54:18] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:54:31] <wikibugs>	 (03PS2) 10Bking: relforge: move relforge1003 into OpenSearch role [puppet] - 10https://gerrit.wikimedia.org/r/1129877 (https://phabricator.wikimedia.org/T380752)
[15:55:55] <wikibugs>	 (03PS1) 10Muehlenhoff: testreduce: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/1129878 (https://phabricator.wikimedia.org/T329529)
[15:56:43] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] cdn: Add roll-upgrade-varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall)
[15:57:37] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129877 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking)
[15:57:37] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129878 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[15:58:46] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] thumbor: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129824 (https://phabricator.wikimedia.org/T389480) (owner: 10Effie Mouzeli)
[15:58:47] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2250 to codfw - jhancock@cumin2002"
[15:58:50] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw-mcrouter: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129816 (owner: 10Effie Mouzeli)
[16:00:05] <jouncebot>	 jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1600).
[16:00:05] <jouncebot>	 tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:13] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2250 to codfw - jhancock@cumin2002"
[16:00:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:00:23] <wikibugs>	 (03CR) 10DLynch: "The errors we got from running the script were clearly saying that the flow-create-board permission was missing, though. It could certainl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129795 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe)
[16:01:07] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[16:01:48] <tgr_>	 o/
[16:02:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2250
[16:02:10] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2250
[16:02:14] <godog>	 sorry i accidentally grafana, one sec
[16:03:02] <godog>	 back
[16:03:52] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw-on-k8s: align alerts with "pools" of capacity [alerts] - 10https://gerrit.wikimedia.org/r/1129358 (https://phabricator.wikimedia.org/T389224) (owner: 10Scott French)
[16:06:14] <wikibugs>	 (03CR) 10Vgutierrez: cdn: Add roll-upgrade-varnish (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall)
[16:07:12] <elukey>	 !log stop imposm on maps1009 to allow fixing the postgres db - T389462
[16:07:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:16] <stashbot>	 T389462: OSM replication lag on maps1009 - https://phabricator.wikimedia.org/T389462
[16:07:27] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] testreduce: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/1129878 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[16:08:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: "reverse-proxying https with mod_proxy is possible, a change similar to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1125990 is nee" [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) (owner: 10Phedenskog)
[16:09:09] <claime>	 godog: the whole grafana?
[16:09:14] <logmsgbot>	 !log eevans@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1043.eqiad.wmnet with reason: Bootstrapping — T389423
[16:09:18] <stashbot>	 T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423
[16:09:50] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.142.0" for 193 host(s)
[16:10:20] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] thumbor: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129824 (https://phabricator.wikimedia.org/T389480) (owner: 10Effie Mouzeli)
[16:10:22] <godog>	 claime: the whole apache to be exact
[16:10:25] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2250.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:10:27] <godog>	 rookie mistake
[16:10:31] <claime>	 damn
[16:10:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657721 (10phaultfinder)
[16:11:12] <godog>	 ikr?
[16:11:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: imposm.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:11:39] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] mw-on-k8s: align alerts with "pools" of capacity [alerts] - 10https://gerrit.wikimedia.org/r/1129358 (https://phabricator.wikimedia.org/T389224) (owner: 10Scott French)
[16:13:41] <urandom>	 !log bootstrapping restbase1034-a/cassandra — T389423
[16:13:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:21] <logmsgbot>	 !log dancy@deploy2002 Installation of scap version "4.142.0" completed for 193 hosts
[16:14:39] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1043-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:18:04] <wikibugs>	 (03PS1) 10Brouberol: Use abspaths when sub-processing dumps commands [dumps] - 10https://gerrit.wikimedia.org/r/1129880 (https://phabricator.wikimedia.org/T388378)
[16:18:56] <elukey>	 !log `ALTER TABLE public.wikidata_relation_members ALTER COLUMN id TYPE bigint;` on maps1009's posgres - T389462
[16:19:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:00] <stashbot>	 T389462: OSM replication lag on maps1009 - https://phabricator.wikimedia.org/T389462
[16:20:59] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2250.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:21:54] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2250.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:22:21] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Thanks" [dumps] - 10https://gerrit.wikimedia.org/r/1129880 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[16:26:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:26:55] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129824 (https://phabricator.wikimedia.org/T389480) (owner: 10Effie Mouzeli)
[16:27:07] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2250.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:30:12] <wikibugs>	 (03PS1) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882
[16:31:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:32:46] <wikibugs>	 (03PS1) 10BCornwall: cdn: Restart varnishmtail on Varnish upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1129884
[16:34:34] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cdn: Restart varnishmtail on Varnish upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1129884 (owner: 10BCornwall)
[16:36:36] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] cdn: Restart varnishmtail on Varnish upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1129884 (owner: 10BCornwall)
[16:36:41] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] cdn: Restart varnishmtail on Varnish upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1129884 (owner: 10BCornwall)
[16:37:43] <wikibugs>	 (03Abandoned) 10BCornwall: sre.cdn.roll-upgrade-haproxy: migrate to SRELBBatchRunnerBaseCDN [cookbooks] - 10https://gerrit.wikimedia.org/r/925681 (owner: 10Jbond)
[16:40:31] <wikibugs>	 (03CR) 10Brouberol: [V:03+2 C:03+2] Use abspaths when sub-processing dumps commands [dumps] - 10https://gerrit.wikimedia.org/r/1129880 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[16:41:03] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3067.esams.wmnet} and A:cp
[16:41:09] <logmsgbot>	 !log brouberol@deploy2002 Started scap build-images: (no justification provided)
[16:41:39] <logmsgbot>	 !log brouberol@deploy2002 Finished scap build-images: (no justification provided) (duration: 00m 30s)
[16:41:54] <wikibugs>	 (03PS1) 10Elukey: maps: fix id type for the table wikidata_relation_members in imposm_mapping [puppet] - 10https://gerrit.wikimedia.org/r/1129886 (https://phabricator.wikimedia.org/T389462)
[16:41:58] <brett>	 !log Upgrading varnish to 7.1 on cp3067 (T378737)
[16:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:02] <stashbot>	 T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737
[16:42:02] <brett>	 since I forgot a --reason :/
[16:42:11] <wikibugs>	 (03PS1) 10Jgiannelos: imposm: Change mapping to use bigint for column `id` [puppet] - 10https://gerrit.wikimedia.org/r/1129888 (https://phabricator.wikimedia.org/T389462)
[16:42:26] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] maps: fix id type for the table wikidata_relation_members in imposm_mapping [puppet] - 10https://gerrit.wikimedia.org/r/1129886 (https://phabricator.wikimedia.org/T389462) (owner: 10Elukey)
[16:42:57] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[16:43:43] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[16:44:30] <wikibugs>	 (03PS1) 10DCausse: cirrus: update alerts based on rc0 topics [alerts] - 10https://gerrit.wikimedia.org/r/1129889 (https://phabricator.wikimedia.org/T375821)
[16:45:12] <wikibugs>	 (03Abandoned) 10Jgiannelos: imposm: Change mapping to use bigint for column `id` [puppet] - 10https://gerrit.wikimedia.org/r/1129888 (https://phabricator.wikimedia.org/T389462) (owner: 10Jgiannelos)
[16:45:35] <wikibugs>	 (03CR) 10Elukey: [C:03+2] maps: fix id type for the table wikidata_relation_members in imposm_mapping [puppet] - 10https://gerrit.wikimedia.org/r/1129886 (https://phabricator.wikimedia.org/T389462) (owner: 10Elukey)
[16:45:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657869 (10phaultfinder)
[16:45:39] <wikibugs>	 (03CR) 10DCausse: "needs to be merged right after I05b8375" [alerts] - 10https://gerrit.wikimedia.org/r/1129889 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse)
[16:45:48] <wikibugs>	 (03CR) 10DCausse: [C:04-2] cirrus: update alerts based on rc0 topics [alerts] - 10https://gerrit.wikimedia.org/r/1129889 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse)
[16:46:28] <fabfur>	 !log imported haproxykafka 0.3.6 into apt repository (added TimestampType)  (T388397) 
[16:46:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:32] <stashbot>	 T388397: Fix `webrequest_frontend` kafka timestamp mismatch with in-data `dt` field - https://phabricator.wikimedia.org/T388397
[16:48:23] <fabfur>	 !log upgrade haproxykafka to 0.3.6 on A:cp (gradual rollout) 
[16:48:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:29] <fabfur>	 joal ^^
[16:49:01] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3067.esams.wmnet} and A:cp
[16:52:22] <wikibugs>	 (03PS1) 10Effie Mouzeli: Revert "thumbor: use monitoring.named_ports" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129890
[16:52:47] <logmsgbot>	 !log brouberol@deploy2002 Started scap build-images: Rebuild mediawiki-cli with recent dumps abspath fix - T388378
[16:52:53] <stashbot>	 T388378: Orchestrate dumps v1 from an airflow instance - https://phabricator.wikimedia.org/T388378
[16:53:12] <logmsgbot>	 !log brouberol@deploy2002 Finished scap build-images: Rebuild mediawiki-cli with recent dumps abspath fix - T388378 (duration: 00m 24s)
[16:53:26] <wikibugs>	 (03PS1) 10BCornwall: sre.cdn.roll-upgrade-varnish: Fix package parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/1129891
[16:53:44] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[16:54:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall)
[16:54:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:55:18] <elukey>	 this is under maintenace --^ but it should be silenced
[16:55:52] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] Revert "thumbor: use monitoring.named_ports" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129890 (owner: 10Effie Mouzeli)
[16:56:02] <elukey>	 ah no right I wasn't able via cookbook since the host was wiped by reimage
[16:56:07] <elukey>	 lemme try to add something manually
[16:57:43] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "thumbor: use monitoring.named_ports" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129890 (owner: 10Effie Mouzeli)
[16:58:14] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3068.esams.wmnet} and A:cp
[16:59:51] <logmsgbot>	 !log brouberol@deploy2002 Started scap build-images: Rebuild mediawiki-cli with recent dumps abspath fix (w/o cache) - T388378
[16:59:55] <stashbot>	 T388378: Orchestrate dumps v1 from an airflow instance - https://phabricator.wikimedia.org/T388378
[17:00:05] <jouncebot>	 bd808: OwO what's this, a deployment window?? Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1700). nyaa~
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1700)
[17:00:29] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] cdn: Add roll-upgrade-varnish (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall)
[17:00:47] <logmsgbot>	 !log brouberol@deploy2002 Finished scap build-images: Rebuild mediawiki-cli with recent dumps abspath fix (w/o cache) - T388378 (duration: 00m 56s)
[17:01:05] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) (owner: 10Filippo Giunchedi)
[17:02:30] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] upgrade cp3068 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129855 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[17:04:59] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1129173 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[17:05:21] <wikibugs>	 (03CR) 10Bking: [C:03+2] "self-merging, as this does not affect production hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1129877 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking)
[17:05:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657999 (10phaultfinder)
[17:06:02] <wikibugs>	 (03PS1) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317)
[17:06:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: imposm.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:07:52] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1129128 (https://phabricator.wikimedia.org/T389072) (owner: 10Filippo Giunchedi)
[17:08:25] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10658021 (10MatthewVernon) That was (suspiciously) easy to re-add, but I notice there's no `megacli` available on this system,...
[17:08:55] <wikibugs>	 (03PS2) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882
[17:13:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:17:11] <wikibugs>	 (03PS3) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882
[17:18:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:19:54] <wikibugs>	 (03PS1) 10Reedy: Sanitizer::normalizeWhitespace: simplify redundant preg_replace [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129895 (https://phabricator.wikimedia.org/T388733)
[17:22:07] <Reedy>	 jouncebot: nowandnext
[17:22:07] <jouncebot>	 For the next 0 hour(s) and 37 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1700)
[17:22:07] <jouncebot>	 For the next 0 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1700)
[17:22:07] <jouncebot>	 In 0 hour(s) and 37 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1800)
[17:22:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] typos: add 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1129868 (owner: 10Andrew Bogott)
[17:22:33] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Sanitizer::normalizeWhitespace: simplify redundant preg_replace [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129895 (https://phabricator.wikimedia.org/T388733) (owner: 10Reedy)
[17:23:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1003* for ban host prior to reimage - bking@cumin2002 - T380752
[17:23:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1003* for ban host prior to reimage - bking@cumin2002 - T380752
[17:23:05] <stashbot>	 T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752
[17:26:37] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3068.esams.wmnet} and A:cp
[17:26:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10658127 (10phaultfinder)
[17:27:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall)
[17:28:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: opensearch-disable-readahead.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:29:49] <wikibugs>	 (03CR) 10BCornwall: [C:04-1] "The code might be good but I think we could give some more background/meaning behind the commit in the message." [puppet] - 10https://gerrit.wikimedia.org/r/1129349 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[17:30:02] <wikibugs>	 (03CR) 10BCornwall: [C:04-1] "Marking unresolved." [puppet] - 10https://gerrit.wikimedia.org/r/1129349 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[17:33:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:34:42] <wikibugs>	 (03PS19) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227)
[17:36:37] <wikibugs>	 (03Merged) 10jenkins-bot: Sanitizer::normalizeWhitespace: simplify redundant preg_replace [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129895 (https://phabricator.wikimedia.org/T388733) (owner: 10Reedy)
[17:37:57] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "I am curious, what was broken here? +1 if it works but still curious." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129891 (owner: 10BCornwall)
[17:38:10] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdd) failed on ms-be1081 - https://phabricator.wikimedia.org/T388697#10658230 (10VRiley-WMF) @MatthewVernon Thanks for the heads up. This disk has been replaced using one of those spares! Still awaiting on Dell to send out the replacment. However, ma...
[17:38:22] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdd) failed on ms-be1081 - https://phabricator.wikimedia.org/T388697#10658231 (10VRiley-WMF) 05Open→03Resolved
[17:38:25] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:39:01] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Ahhhh ok nvm, I see it now." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129891 (owner: 10BCornwall)
[17:39:56] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the review!" [alerts] - 10https://gerrit.wikimedia.org/r/1129358 (https://phabricator.wikimedia.org/T389224) (owner: 10Scott French)
[17:40:17] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-on-k8s: align alerts with "pools" of capacity [alerts] - 10https://gerrit.wikimedia.org/r/1129358 (https://phabricator.wikimedia.org/T389224) (owner: 10Scott French)
[17:40:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:41:44] <wikibugs>	 (03Merged) 10jenkins-bot: mw-on-k8s: align alerts with "pools" of capacity [alerts] - 10https://gerrit.wikimedia.org/r/1129358 (https://phabricator.wikimedia.org/T389224) (owner: 10Scott French)
[17:44:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10658330 (10phaultfinder)
[17:44:41] <wikibugs>	 (03CR) 10Dreamy Jazz: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester)
[17:45:25] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:47:11] <wikibugs>	 (03CR) 10Ssingh: "I think this is a good idea and much cleaner. The only question I have is if you know why we had the specific version field for ATS. I can" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall)
[17:48:28] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on relforge1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[17:48:29] <wikibugs>	 (03PS1) 10Cwhite: add statsv throughput alerts [alerts] - 10https://gerrit.wikimedia.org/r/1129899 (https://phabricator.wikimedia.org/T389469)
[17:49:26] <wikibugs>	 (03PS3) 10Gergő Tisza: varnish: Fix X-Wikimedia-Debug cookie handling [puppet] - 10https://gerrit.wikimedia.org/r/1129349 (https://phabricator.wikimedia.org/T350094)
[17:50:20] <wikibugs>	 (03CR) 10BCornwall: "Honestly, I don't recall at all, and I don't see it being useful for the purposes of these cookbooks since their goal is to roll out new v" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall)
[17:50:25] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: elasticsearch_7@relforge-eqiad-small-alpha.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:50:38] <wikibugs>	 (03CR) 10Gergő Tisza: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1129349 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[17:51:22] <sukhe>	 !log sudo cumin 'A:cp-text' 'disable-puppet "rolling out CR 1129349"': T350094
[17:51:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:26] <stashbot>	 T350094: Enable verbose logging without installing the WikimediaDebug extension - https://phabricator.wikimedia.org/T350094
[17:52:56] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] varnish: Fix X-Wikimedia-Debug cookie handling [puppet] - 10https://gerrit.wikimedia.org/r/1129349 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[17:53:43] <logmsgbot>	 !log reedy@deploy2002 Synchronized php-1.44.0-wmf.21/includes/parser/Sanitizer.php: T388733 (duration: 11m 36s)
[17:53:47] <stashbot>	 T388733: PHP Warning: MediaWiki\Parser\Sanitizer::normalizeWhitespace: Failed to normalize whitespace: 6 [Called from MediaWiki\Parser\Sanitizer::normalizeWhitespace in /srv/mediawiki/php-1.44.0-wmf.20/includes/parser/Sanitizer.php - https://phabricator.wikimedia.org/T388733
[17:54:15] <wikibugs>	 (03PS4) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882
[17:54:21] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[17:55:25] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: elasticsearch_7@relforge-eqiad-small-alpha.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:56:23] <wikibugs>	 (03CR) 10Ssingh: "@rcoccioli@wikimedia.org: any thoughts on the unification part? Brett's current approach would make it cleaner to have one cookbook but he" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall)
[17:57:35] <sukhe>	 !log enable puppet and run agent on cp3071 to test CR 1129349
[17:57:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:05] <jouncebot>	 jnuche and jeena: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1800). nyaa~
[18:00:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: elasticsearch_7@relforge-eqiad-small-alpha.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:02:18] <sukhe>	 !log sudo cumin -b11 'A:cp-text' 'enable-puppet-agent "rolling out CR 1129349"': T350094
[18:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:22] <stashbot>	 T350094: Enable verbose logging without installing the WikimediaDebug extension - https://phabricator.wikimedia.org/T350094
[18:02:48] <sukhe>	 !log sudo cumin -b11 'A:cp-text' 'run-puppet-agent "rolling out CR 1129349"': T350094
[18:02:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:28] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on relforge1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[18:08:04] <wikibugs>	 (03CR) 10BCornwall: "Specifically, the complexity of a single cookbook would increase quite a bit, starting with having to pair the service with associated pac" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall)
[18:08:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10658621 (10phaultfinder)
[18:08:56] <wikibugs>	 (03PS1) 10Gergő Tisza: Enable SUL3 logins for 50% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129905 (https://phabricator.wikimedia.org/T384153)
[18:09:00] <wikibugs>	 (03CR) 10BCornwall: "For posterity, the join() was creating one long comma-separated string that was then passed to apt, e.g. `apt-get install foo,bar,baz`)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129891 (owner: 10BCornwall)
[18:09:04] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] sre.cdn.roll-upgrade-varnish: Fix package parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/1129891 (owner: 10BCornwall)
[18:09:53] <wikibugs>	 (03CR) 10Ssingh: "Yeah I am fine with merging this, unless volans can suggest a clean way of handling it. (He usually has surprises)." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall)
[18:10:25] <jinxer-wm>	 RESOLVED: [5x] SystemdUnitFailed: elasticsearch_7@relforge-eqiad-small-alpha.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:10:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129905 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza)
[18:10:48] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3069.esams.wmnet} and A:cp
[18:12:01] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] upgrade cp3069 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129856 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[18:12:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in relforge
[18:12:21] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in relforge
[18:14:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] hieradata: migrate mw-misc to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1128923 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:14:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-misc: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128924 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:14:58] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529 (10Cpetrillo) 03NEW
[18:15:55] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:19:11] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3069.esams.wmnet} and A:cp
[18:20:55] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:22:35] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529#10658717 (10Milimetric) approved as I am authorized to do per [[ https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/admin/d...
[18:22:45] <wikibugs>	 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10658718 (10BCornwall) Were they able to get back to you, @RobH ?
[18:24:12] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10658719 (10BCornwall) Hi, @RobH, has this been able to be looked at? It's been depooled for a while now.  Thanks!
[18:24:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on relforge1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[18:25:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10658743 (10phaultfinder)
[18:25:55] <jinxer-wm>	 RESOLVED: [5x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:28:36] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529#10658759 (10ssingh)
[18:31:12] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10658770 (10RobH) Working on it now, pulling reports from idrac for case.
[18:31:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:32:10] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:36:35] <swfrench-wmf>	 jouncebot: nowandnext
[18:36:35] <jouncebot>	 For the next 1 hour(s) and 23 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1800)
[18:36:36] <jouncebot>	 In 1 hour(s) and 23 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2000)
[18:36:55] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:38:23] <swfrench-wmf>	 jeena: I see that the train rolled to group2 in the earlier window. would you have any objections if I were to deploy some changes during this window? (one last PHP 8.1 switch)
[18:38:44] <jeena>	 swfrench-wmf: yes that would be fine afaik
[18:38:58] <swfrench-wmf>	 jeena: great, thank you!
[18:39:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on relforge1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[18:40:04] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1128923 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:40:08] <wikibugs>	 (03CR) 10Scott French: [C:03+2] hieradata: migrate mw-misc to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1128923 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:41:21] <wikibugs>	 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10658820 (10RobH) Yes, and they saw no temp errors in their investigation of the logs.  I'll flag this and dump their updates to this task later this week.
[18:42:04] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-misc: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128924 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:43:32] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1129899 (https://phabricator.wikimedia.org/T389469) (owner: 10Cwhite)
[18:43:46] <wikibugs>	 (03Merged) 10jenkins-bot: mw-misc: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128924 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:45:47] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529#10658825 (10ssingh)
[18:46:27] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529#10658838 (10ssingh) @lanebecker: this requires your approval, thanks.  (Thanks @Milimetric)
[18:46:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:48:36] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Switch mw-misc to PHP 8.1 - T383845
[18:48:40] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[18:49:28] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on relforge1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[18:51:05] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Switch mw-misc to PHP 8.1 - T383845 (duration: 03m 22s)
[18:51:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1003* for ban host to test reimage - bking@cumin2002 - T380752
[18:51:14] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1003* for ban host to test reimage - bking@cumin2002 - T380752
[18:51:19] <stashbot>	 T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752
[18:51:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529#10658849 (10lanebecker) Dropping in from holiday mode to approve. Approved!
[18:51:55] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:52:38] <wikibugs>	 (03CR) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester)
[18:53:32] <wikibugs>	 (03PS1) 10Ssingh: admin: add cpetrillo to analytics-privatedata-users (with krb) [puppet] - 10https://gerrit.wikimedia.org/r/1129913 (https://phabricator.wikimedia.org/T389529)
[18:54:26] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529#10658881 (10ssingh)
[18:54:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10658884 (10phaultfinder)
[18:55:41] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10658893 (10RobH)
[18:56:55] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:57:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in relforge
[18:57:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in relforge
[18:59:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10658895 (10phaultfinder)
[19:00:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:01:19] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] admin: add cpetrillo to analytics-privatedata-users (with krb) [puppet] - 10https://gerrit.wikimedia.org/r/1129913 (https://phabricator.wikimedia.org/T389529) (owner: 10Ssingh)
[19:01:52] <wikibugs>	 (03PS1) 10Jforrester: AbstractIterator: Make PHP 8.1 compatible [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129914 (https://phabricator.wikimedia.org/T389515)
[19:02:29] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] upgrade cp3070 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129857 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[19:02:46] <swfrench-wmf>	 FYI, barring any surprises with mw-misc, this concludes my changes
[19:03:28] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] admin: add cpetrillo to analytics-privatedata-users (with krb) [puppet] - 10https://gerrit.wikimedia.org/r/1129913 (https://phabricator.wikimedia.org/T389529) (owner: 10Ssingh)
[19:03:45] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3070.esams.wmnet} and A:cp
[19:04:39] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service restbase1043-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:07:21] <jinxer-wm>	 RESOLVED: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[19:07:21] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.143.0" for 193 host(s)
[19:07:27] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:08:44] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[19:09:39] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:10:18] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3070.esams.wmnet} and A:cp
[19:11:50] <logmsgbot>	 !log dancy@deploy2002 Installation of scap version "4.143.0" completed for 193 hosts
[19:11:52] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529#10658944 (10ssingh) 05Open→03Resolved a:03ssingh ` sukhe@krb1001:~$ sudo manage_principals.py create cpetrillo --email_addres...
[19:13:21] <logmsgbot>	 !log dancy@deploy2002 Started scap sync-world: T388761
[19:13:22] <wikibugs>	 (03PS1) 10Bking: relforge: add relforge1004 as master eligible [puppet] - 10https://gerrit.wikimedia.org/r/1129915 (https://phabricator.wikimedia.org/T380752)
[19:13:25] <stashbot>	 T388761: scap needs to be k8s-cluster aware - https://phabricator.wikimedia.org/T388761
[19:13:41] <wikibugs>	 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T389537 (10phaultfinder) 03NEW
[19:14:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10658968 (10phaultfinder)
[19:15:08] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129915 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking)
[19:15:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:16:47] <denisse>	 !log restarting prometheus@ops.service in prometheus1005
[19:16:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:19] <sukhe>	 thanks denisse!
[19:17:21] <wikibugs>	 (03CR) 10Bking: [C:03+2] relforge: add relforge1004 as master eligible [puppet] - 10https://gerrit.wikimedia.org/r/1129915 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking)
[19:17:33] <wikibugs>	 (03CR) 10Bking: [C:03+2] "self-merging, as this does not touch production" [puppet] - 10https://gerrit.wikimedia.org/r/1129915 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking)
[19:18:20] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[19:18:44] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[19:21:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1004* for ban host to test reimage - bking@cumin2002 - T380752
[19:21:48] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1004* for ban host to test reimage - bking@cumin2002 - T380752
[19:21:51] <stashbot>	 T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752
[19:22:27] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:24:37] <logmsgbot>	 !log dancy@deploy2002 Finished scap sync-world: T388761 (duration: 11m 15s)
[19:24:41] <stashbot>	 T388761: scap needs to be k8s-cluster aware - https://phabricator.wikimedia.org/T388761
[19:25:25] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:26:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in relforge
[19:26:34] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in relforge
[19:27:33] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "lgtm, if it's in an order like on https://phabricator.wikimedia.org/T326368 and the "profile::gerrit::active_host" is also changed in Hier" [dns] - 10https://gerrit.wikimedia.org/r/1128818 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[19:27:34] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1003* for ban host to test puppet code - bking@cumin2002 - T380752
[19:27:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1003* for ban host to test puppet code - bking@cumin2002 - T380752
[19:27:37] <stashbot>	 T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752
[19:28:40] <wikibugs>	 (03PS2) 10Gergő Tisza: Clear stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433)
[19:29:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Clear stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) (owner: 10Gergő Tisza)
[19:31:55] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:32:18] <wikibugs>	 (03CR) 10Dzahn: gerrit: switchover to gerrit2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1128814 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[19:33:28] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in relforge
[19:33:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in relforge
[19:33:43] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "generally looks fine to me, it's just about the order of things. so.. first disable gerrit on source.. then sync lfs data one last time.. " [puppet] - 10https://gerrit.wikimedia.org/r/1128814 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[19:35:31] <wikibugs>	 (03PS3) 10Gergő Tisza: Clear stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433)
[19:38:20] <kamila_>	 jouncebot: nowandnext
[19:38:20] <jouncebot>	 For the next 0 hour(s) and 21 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1800)
[19:38:20] <jouncebot>	 In 0 hour(s) and 21 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2000)
[19:42:10] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9400.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:42:36] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] upgrade cp3071 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129858 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[19:43:22] <wikibugs>	 10SRE-swift-storage: Swift file replicated to codfw but not eqiad - https://phabricator.wikimedia.org/T389539 (10Dylsss) 03NEW
[19:44:05] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3071.esams.wmnet} and A:cp
[19:46:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:46:39] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[19:47:57] <wikibugs>	 (03PS1) 10Dzahn: gerrit: replace hardcoded host name and codfw string for replica [puppet] - 10https://gerrit.wikimedia.org/r/1129919
[19:48:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: replace hardcoded host name and codfw string for replica [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (owner: 10Dzahn)
[19:48:23] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] Clear stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) (owner: 10Gergő Tisza)
[19:50:01] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1116
[19:50:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:50:37] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3071.esams.wmnet} and A:cp
[19:50:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10659058 (10phaultfinder)
[19:50:55] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for elastic - jclark@cumin1002"
[19:51:01] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for elastic - jclark@cumin1002"
[19:51:01] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:51:09] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1117
[19:51:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:51:20] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host elastic1117
[19:51:23] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1116
[19:51:29] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1117
[19:52:47] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1117
[19:52:59] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1118
[19:54:05] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1118
[19:54:09] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1119
[19:55:14] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1119
[19:55:19] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1120
[19:55:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:56:32] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1120
[19:56:57] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1121
[19:57:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) (owner: 10Gergő Tisza)
[19:58:16] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1121
[19:58:30] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1122
[19:59:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:59:42] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1122
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2000)
[20:00:05] <jouncebot>	 inflatador, cwhite, Superpes, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:16] <inflatador>	 .o/
[20:00:24] <cwhite>	 o/
[20:01:07] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[20:02:05] <tgr_>	 o/
[20:02:40] <wikibugs>	 (03PS1) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833)
[20:02:47] <wikibugs>	 (03PS2) 10Dzahn: gerrit: replace hardcoded host name and codfw string for replica [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (https://phabricator.wikimedia.org/T387833)
[20:03:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: replace hardcoded host name and codfw string for replica [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[20:03:35] <wikibugs>	 (03CR) 10Dzahn: "hrmm,, something is wrong with the syntax but "$first_element = lookup('my_array')[0]" is supposed to be it. Anyways.. for now just presen" [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[20:03:40] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] gerrit: replace hardcoded host name and codfw string for replica [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[20:04:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:04:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10659093 (10Jclark-ctr)
[20:05:21] <wikibugs>	 (03CR) 10Dzahn: "I would like to avoid that we always have to replace hardcoded host names in multiple places every time we switch.. this is just a first i" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[20:07:53] <tgr_>	 I can deploy
[20:08:28] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "query tested (71 rows in 0.128 seconds)" [puppet] - 10https://gerrit.wikimedia.org/r/1129806 (https://phabricator.wikimedia.org/T380300) (owner: 10Aklapper)
[20:09:55] <wikibugs>	 (03PS1) 10Krinkle: docroot: Enable Chrome credential sharing on foundation.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129922 (https://phabricator.wikimedia.org/T385520)
[20:10:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129181 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse)
[20:11:55] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] docroot: Enable Chrome credential sharing on foundation.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129922 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle)
[20:13:06] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: explicitly route search traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129181 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse)
[20:13:25] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1129181|cirrus: explicitly route search traffic to eqiad (T388610)]]
[20:13:29] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[20:18:22] <logmsgbot>	 !log tgr@deploy2002 dcausse, tgr: Backport for [[gerrit:1129181|cirrus: explicitly route search traffic to eqiad (T388610)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:20:16] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] upgrade cp3072 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129859 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[20:21:29] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3072.esams.wmnet} and A:cp
[20:21:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10659176 (10phaultfinder)
[20:25:07] <tgr_>	 inflatador: do you want to test it?
[20:26:15] <inflatador>	 tgr_ 1 sec
[20:26:39] <inflatador>	 tgr_ no, we're good
[20:27:02] <logmsgbot>	 !log tgr@deploy2002 dcausse, tgr: Continuing with sync
[20:27:26] <wikibugs>	 (03PS2) 10Jforrester: AbstractIterator: Make PHP 8.1 compatible [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129914 (https://phabricator.wikimedia.org/T389515)
[20:27:35] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3072.esams.wmnet} and A:cp
[20:27:45] <wikibugs>	 (03CR) 10Jforrester: "Re-cherry-picked now that the patch has landed in master, so we get the nice blame git hash." [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129914 (https://phabricator.wikimedia.org/T389515) (owner: 10Jforrester)
[20:28:45] <wikibugs>	 (03CR) 10Dzahn: "I see T381417 is now resolved. How about the status of this now?" [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[20:29:55] <wikibugs>	 (03CR) 10Dzahn: create a namespace for codesearch on k8s-aux cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[20:31:13] <wikibugs>	 (03PS1) 10Cwhite: es_exporter: constrain wikifunctions query [puppet] - 10https://gerrit.wikimedia.org/r/1129927 (https://phabricator.wikimedia.org/T388174)
[20:32:40] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "done and done. I should be able to deploy the new namespace at any time. Docs per Alex: https://wikitech.wikimedia.org/wiki/Kubernetes/Add" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[20:32:45] <wikibugs>	 (03PS2) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317)
[20:34:32] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129181|cirrus: explicitly route search traffic to eqiad (T388610)]] (duration: 21m 07s)
[20:34:36] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[20:35:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite)
[20:37:53] <wikibugs>	 (03PS3) 10Reedy: AbstractIterator: Make PHP 8.1 compatible [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129914 (https://phabricator.wikimedia.org/T389515) (owner: 10Jforrester)
[20:39:03] <wikibugs>	 (03PS20) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227)
[20:39:53] <wikibugs>	 (03CR) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester)
[20:41:07] <wikibugs>	 (03Merged) 10jenkins-bot: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite)
[20:41:25] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1081461|Profiler: emit both statsd and dogstatsd (T359385)]]
[20:41:28] <stashbot>	 T359385: Migrate MediaWiki.arclamp to statslib - https://phabricator.wikimedia.org/T359385
[20:41:53] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[20:45:35] <Reedy>	 jouncebot: nowandnext
[20:45:35] <jouncebot>	 For the next 0 hour(s) and 14 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2000)
[20:45:35] <jouncebot>	 In 0 hour(s) and 14 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2100)
[20:45:53] <logmsgbot>	 !log tgr@deploy2002 cwhite, tgr: Backport for [[gerrit:1081461|Profiler: emit both statsd and dogstatsd (T359385)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:46:41] <kamila_>	 Reedy: are you planning to do deployments? I'd like to do the deployment server switchover after this deploy window, but I can wait if you need me to 
[20:47:11] <Reedy>	 kamila_: I wouldn't mind, it gets rid quite a lot of logspam
[20:47:49] <kamila_>	 Reedy: ok, go ahead and lmk when you're done please :-)
[20:48:09] <Krinkle>	 once tgr_ is done with the current deployment, I;d like to claim mwdebug1002 to do debug an issue.
[20:48:14] <wikibugs>	 (03CR) 10Reedy: [C:03+2] AbstractIterator: Make PHP 8.1 compatible [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129914 (https://phabricator.wikimedia.org/T389515) (owner: 10Jforrester)
[20:48:31] <tgr_>	 kamila_: if it's not urgent, I'd like to deploy the fix for T389433, and that will probably take a while (code is pretty much untestable outside production)
[20:48:32] <stashbot>	 T389433: Fix stuck old cookies on Wikitech - https://phabricator.wikimedia.org/T389433
[20:49:31] <tgr_>	 cwhite: do you need to test?
[20:49:32] <wikibugs>	 (03PS1) 10Andrew Bogott: Update horizon version in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1129935 (https://phabricator.wikimedia.org/T380531)
[20:49:57] <cwhite>	 tgr_: LGTM so far, no errors AFAICT
[20:50:04] <wikibugs>	 (03CR) 10Ecarg: [C:03+1] es_exporter: constrain wikifunctions query [puppet] - 10https://gerrit.wikimedia.org/r/1129927 (https://phabricator.wikimedia.org/T388174) (owner: 10Cwhite)
[20:50:06] <logmsgbot>	 !log tgr@deploy2002 cwhite, tgr: Continuing with sync
[20:50:36] <kamila_>	 tgr_: what is "a while"? it's not urgent, so you can go ahead, but I'd like to know roughly when I'll be able to start
[20:51:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Update horizon version in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1129935 (https://phabricator.wikimedia.org/T380531) (owner: 10Andrew Bogott)
[20:51:14] <wikibugs>	 (03Merged) 10jenkins-bot: AbstractIterator: Make PHP 8.1 compatible [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129914 (https://phabricator.wikimedia.org/T389515) (owner: 10Jforrester)
[20:51:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.382s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:51:38] <tgr_>	 I have two config patches that don't need testing, if those can be batched with Reedy's patch, that's ~15 min
[20:51:49] <tgr_>	 then the wikitech patch is another half an hour maybe?
[20:52:00] <kamila_>	 ok, cool, thanks tgr_ 
[20:53:22] <tgr_>	 Reedy: does that sound ok?
[20:53:36] <Reedy>	 wfm. Mine doesn't need testing
[20:53:44] <Reedy>	 as it's causing cli logspam (dumps)
[20:54:57] <wikibugs>	 (03PS1) 10Cwhite: es_exporter: add metric gathering for wikifunctions backend services [puppet] - 10https://gerrit.wikimedia.org/r/1129936 (https://phabricator.wikimedia.org/T388174)
[20:56:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.39s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:56:28] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#10659322 (10Umherirrender)
[20:57:36] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081461|Profiler: emit both statsd and dogstatsd (T359385)]] (duration: 16m 11s)
[20:57:40] <stashbot>	 T359385: Migrate MediaWiki.arclamp to statslib - https://phabricator.wikimedia.org/T359385
[20:58:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129435 (https://phabricator.wikimedia.org/T389400) (owner: 10Superpes15)
[20:58:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129905 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza)
[20:59:21] <wikibugs>	 (03Merged) 10jenkins-bot: Throttle exemption for Editathon in Ciudad de Buenos Aires - 29 March 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129435 (https://phabricator.wikimedia.org/T389400) (owner: 10Superpes15)
[20:59:23] <wikibugs>	 (03Merged) 10jenkins-bot: Enable SUL3 logins for 50% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129905 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza)
[20:59:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10659334 (10phaultfinder)
[20:59:44] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1129435|Throttle exemption for Editathon in Ciudad de Buenos Aires - 29 March 2025 (T389400)]], [[gerrit:1129905|Enable SUL3 logins for 50% of group 1 users (T384153)]], [[gerrit:1129914|AbstractIterator: Make PHP 8.1 compatible (T389515)]]
[20:59:51] <stashbot>	 T389400: Lift IP for a edit-a-thon in Ciudad de Buenos Aires, Argentina 2025-03-29  - https://phabricator.wikimedia.org/T389400
[20:59:51] <stashbot>	 T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153
[20:59:51] <stashbot>	 T389515: PHP Deprecated: Return type of Flow\Search\Iterators\AbstractIterator::current() should either be compatible with Iterator::current(): mixed, or the #[\ReturnTypeWillChange] attribute should be used - https://phabricator.wikimedia.org/T389515
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2100)
[21:00:31] <Reedy>	 thanks tgr_ :)
[21:01:24] <wikibugs>	 (03CR) 10Herron: [C:03+1] "Thanks for the ping, LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[21:04:35] <logmsgbot>	 !log tgr@deploy2002 tgr, jforrester, superpes: Backport for [[gerrit:1129435|Throttle exemption for Editathon in Ciudad de Buenos Aires - 29 March 2025 (T389400)]], [[gerrit:1129905|Enable SUL3 logins for 50% of group 1 users (T384153)]], [[gerrit:1129914|AbstractIterator: Make PHP 8.1 compatible (T389515)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:06:10] <wikibugs>	 (03CR) 10Ecarg: [C:03+1] es_exporter: add metric gathering for wikifunctions backend services [puppet] - 10https://gerrit.wikimedia.org/r/1129936 (https://phabricator.wikimedia.org/T388174) (owner: 10Cwhite)
[21:06:36] <logmsgbot>	 !log tgr@deploy2002 tgr, jforrester, superpes: Continuing with sync
[21:13:19] <wikibugs>	 (03PS1) 10Ahmon Dancy: cloud.yaml: Supply a reasonable default for profile::tlsproxy::envoy::global_cert_name [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946)
[21:13:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cloud.yaml: Supply a reasonable default for profile::tlsproxy::envoy::global_cert_name [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy)
[21:14:11] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129435|Throttle exemption for Editathon in Ciudad de Buenos Aires - 29 March 2025 (T389400)]], [[gerrit:1129905|Enable SUL3 logins for 50% of group 1 users (T384153)]], [[gerrit:1129914|AbstractIterator: Make PHP 8.1 compatible (T389515)]] (duration: 14m 26s)
[21:14:17] <stashbot>	 T389400: Lift IP for a edit-a-thon in Ciudad de Buenos Aires, Argentina 2025-03-29  - https://phabricator.wikimedia.org/T389400
[21:14:18] <stashbot>	 T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153
[21:14:18] <stashbot>	 T389515: PHP Deprecated: Return type of Flow\Search\Iterators\AbstractIterator::current() should either be compatible with Iterator::current(): mixed, or the #[\ReturnTypeWillChange] attribute should be used - https://phabricator.wikimedia.org/T389515
[21:14:27] <wikibugs>	 10SRE-swift-storage, 06Commons, 07Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T389538#10659391 (10Pppery)
[21:14:42] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] es_exporter: constrain wikifunctions query [puppet] - 10https://gerrit.wikimedia.org/r/1129927 (https://phabricator.wikimedia.org/T388174) (owner: 10Cwhite)
[21:14:45] <wikibugs>	 (03PS2) 10Ahmon Dancy: cloud.yaml: Supply default for profile::tlsproxy::envoy::global_cert_name [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946)
[21:14:53] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] es_exporter: add metric gathering for wikifunctions backend services [puppet] - 10https://gerrit.wikimedia.org/r/1129936 (https://phabricator.wikimedia.org/T388174) (owner: 10Cwhite)
[21:14:54] <Superpes>	 Thanks tgr_ 
[21:14:57] <Superpes>	 :)
[21:14:57] <wikibugs>	 (03PS1) 10Ahmon Dancy: profile::tlsproxy::envoy: Tweak an error message [puppet] - 10https://gerrit.wikimedia.org/r/1129940
[21:15:15] <tgr_>	 deep breath
[21:15:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) (owner: 10Gergő Tisza)
[21:16:10] <wikibugs>	 (03Merged) 10jenkins-bot: Clear stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) (owner: 10Gergő Tisza)
[21:16:27] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1129845|Clear stuck session cookies on Wikitech (T389433)]]
[21:16:31] <stashbot>	 T389433: Fix stuck old cookies on Wikitech - https://phabricator.wikimedia.org/T389433
[21:19:09] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1129845|Clear stuck session cookies on Wikitech (T389433)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:29:33] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[21:31:40] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] upgrade cp3073 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129860 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[21:31:55] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3073.esams.wmnet} and A:cp
[21:32:27] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service restbase1043-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:32:50] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] cirrus: update alerts based on rc0 topics [alerts] - 10https://gerrit.wikimedia.org/r/1129889 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse)
[21:33:39] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] cirrus-streaming-updater: produce to v1 update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124485 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse)
[21:33:44] <urandom>	 !log bootstrapping restbase1034-b/cassandra — T389423
[21:33:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:48] <stashbot>	 T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423
[21:34:19] <wikibugs>	 (03PS1) 10Ahmon Dancy: SpiderPig: Require explicit hiera config to enable Spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945)
[21:34:21] <urandom>	 !log bootstrapping restbase1043-b/cassandra — T389423 (previous msg(s) typo-ed)
[21:34:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:36:02] <wikibugs>	 (03PS2) 10Ahmon Dancy: SpiderPig: Require explicit hiera config to enable Spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945)
[21:36:58] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129845|Clear stuck session cookies on Wikitech (T389433)]] (duration: 20m 31s)
[21:37:02] <stashbot>	 T389433: Fix stuck old cookies on Wikitech - https://phabricator.wikimedia.org/T389433
[21:37:27] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service restbase1043-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:38:01] <wikibugs>	 (03CR) 10Dzahn: "thanks for the fix!  Just the part that git blame tells me the line is like this since 2020 confuses me right now. Because the puppet erro" [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy)
[21:38:03] <wikibugs>	 (03PS1) 10Andrew Bogott: Update horizon version in codfw1dev, again [puppet] - 10https://gerrit.wikimedia.org/r/1129944 (https://phabricator.wikimedia.org/T380531)
[21:38:12] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3073.esams.wmnet} and A:cp
[21:38:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Update horizon version in codfw1dev, again [puppet] - 10https://gerrit.wikimedia.org/r/1129944 (https://phabricator.wikimedia.org/T380531) (owner: 10Andrew Bogott)
[21:39:01] <tgr_>	 kamila_: all done, sorry for the wait!
[21:39:10] <tgr_>	 !log late UTC deploys done
[21:39:10] <kamila_>	 np, thanks tgr_ !
[21:39:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:21] <wikibugs>	 (03CR) 10Ahmon Dancy: "Yeah, it's the "include profile::tlsproxy::envoy" at https://gerrit.wikimedia.org/g/operations/puppet/+/refs/changes/31/1094531/25/modules" [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy)
[21:39:32] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[21:40:43] <wikibugs>	 (03PS1) 10Jasmine: wmnet: update deployment CNAME record to deploy1003 [dns] - 10https://gerrit.wikimedia.org/r/1129945 (https://phabricator.wikimedia.org/T385155)
[21:41:28] <wikibugs>	 (03CR) 10Dzahn: "puppet breakage on non-prod-deployment servers -> https://phabricator.wikimedia.org/T383946#10658168 - thanks for the fix at https://gerri" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[21:41:38] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "Gotcha! thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy)
[21:42:57] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "I really hope it's not going to affect other cloud VPS machines using envoy that aren't deployment servers.. this global cloud.yaml is bro" [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy)
[21:43:18] <kamila_>	 jouncebot: nowandnext
[21:43:18] <jouncebot>	 For the next 0 hour(s) and 16 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2100)
[21:43:19] <jouncebot>	 In 8 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250321T0600)
[21:44:26] <kamila_>	 Note that jasmine_ and I are switching the deployment server as part of the datacenter switchover process. The current deployment server will be deploy1003.eqiad.wmnet . We do not expect this to cause any issues, but ping me or jasmine_ if you think you found one! Thanks :-)
[21:47:59] <wikibugs>	 (03CR) 10Dzahn: "You may already be aware, but please keep in mind there are a couple other places in Hiera where "the deployment server" is defined:" [dns] - 10https://gerrit.wikimedia.org/r/1129945 (https://phabricator.wikimedia.org/T385155) (owner: 10Jasmine)
[21:49:45] <logmsgbot>	 !log kamila@deploy2002 Locking from deployment [MediaWiki]: deployment server switch -- T385155
[21:49:49] <stashbot>	 T385155:  🧭 Northward Datacentre Switchover (March 2025)  - https://phabricator.wikimedia.org/T385155
[21:52:16] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:04-1] "Holding.  Not working as expected in beta yet." [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[21:54:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10659704 (10phaultfinder)
[21:54:59] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] wmnet: point deploy server at eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127073 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[21:56:26] <wikibugs>	 (03Abandoned) 10Jasmine: wmnet: update deployment CNAME record to deploy1003 [dns] - 10https://gerrit.wikimedia.org/r/1129945 (https://phabricator.wikimedia.org/T385155) (owner: 10Jasmine)
[21:56:44] <wikibugs>	 (03CR) 10Dzahn: "probably safer to pass the parameter through to the systemd::service defines so that services are stopped if you ever go backwards from pr" [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[21:56:44] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#10659728 (10Tgr) >>! In T389543#10659214, @Tgr wrote: >...
[21:58:34] <mutante>	 jouncebot: nowandnext
[21:58:34] <jouncebot>	 For the next 0 hour(s) and 1 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2100)
[21:58:34] <jouncebot>	 In 8 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250321T0600)
[21:59:20] <wikibugs>	 (03PS2) 10Kamila Součková: wmnet: point deploy server at eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127073 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[21:59:32] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "this fixed the error but there is a new one after that:" [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy)
[21:59:42] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:04-1] "That happens on lines 38 and 43 of modules/profile/manifests/scap/spiderpig.pp.  Or do you mean something else?" [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[22:00:00] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] wmnet: point deploy server at eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127073 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[22:00:26] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "Hmm.. this seems like it would affect all cloud VPS machines using envoy now.. tempted to revert" [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy)
[22:00:58] <wikibugs>	 (03CR) 10Ahmon Dancy: "Go ahead and revert.  Let's see if we can figure out something better tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy)
[22:00:58] <logmsgbot>	 !log kamila@dns1004 START - running authdns-update
[22:01:22] <mutante>	 kamila_: are you planning to change common.yaml and common/scap.yaml after the DNS change, not before?
[22:01:31] <dancy>	 mutante:  I've lost my steam for the day. Can we regroup tomorrow?
[22:01:47] <mutante>	 dancy: sounds good, yes
[22:02:05] <mutante>	 I guess revert is slightly better than not revert
[22:02:15] <dancy>	 Agreed.
[22:02:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.94% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:02:40] <wikibugs>	 (03PS1) 10Dzahn: Revert "cloud.yaml: Supply default for profile::tlsproxy::envoy::global_cert_name" [puppet] - 10https://gerrit.wikimedia.org/r/1129949
[22:02:41] <kamila_>	 mutante: after, I am sitting on a scap lock
[22:02:55] <mutante>	 kamila_: gotcha!:)
[22:03:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "cloud.yaml: Supply default for profile::tlsproxy::envoy::global_cert_name" [puppet] - 10https://gerrit.wikimedia.org/r/1129949 (owner: 10Dzahn)
[22:03:30] <mutante>	 gotta love -1 on reverts 
[22:03:50] <kamila_>	 noms
[22:03:52] <mutante>	 ah, long lines.. 
[22:04:03] <wikibugs>	 (03PS2) 10Dzahn: Revert "cloud.yaml: Supply default for profile::tlsproxy::envoy::global_cert_name" [puppet] - 10https://gerrit.wikimedia.org/r/1129949
[22:05:07] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "cloud.yaml: Supply default for profile::tlsproxy::envoy::global_cert_name" [puppet] - 10https://gerrit.wikimedia.org/r/1129949 (owner: 10Dzahn)
[22:07:02] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet
[22:07:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:08:12] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10659784 (10RobH) Support won't push the case further until we update all firmware, doing so now.
[22:08:49] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cp4047.ulsfo.wmnet
[22:09:04] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet
[22:09:24] <logmsgbot>	 !log dzahn@dns1004 START - running authdns-update
[22:09:29] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cp4047.ulsfo.wmnet
[22:11:24] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] upgrade cp3075 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129861 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[22:12:46] <logmsgbot>	 !log kamila@dns1004 START - running authdns-update
[22:13:08] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3075.esams.wmnet} and A:cp
[22:18:48] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3075.esams.wmnet} and A:cp
[22:22:12] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4046.ulsfo.wmnet
[22:23:10] <wikibugs>	 (03PS1) 10Kamila Součková: hieradata: update deployment_server to deploy1003 [puppet] - 10https://gerrit.wikimedia.org/r/1129952 (https://phabricator.wikimedia.org/T385155)
[22:26:16] <wikibugs>	 (03CR) 10Jasmine: [C:03+1] hieradata: update deployment_server to deploy1003 [puppet] - 10https://gerrit.wikimedia.org/r/1129952 (https://phabricator.wikimedia.org/T385155) (owner: 10Kamila Součková)
[22:26:25] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Incorrect X-Cache-Status reported by deployment-prep caches - https://phabricator.wikimedia.org/T269825#10659837 (10Aklapper) @BBlack, @Vgutierrez: Could you please answer the last comment? Thanks in advance!
[22:38:22] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns1005.wikimedia.org
[22:38:55] <sukhe>	 !log depool dns1005 to debug zone files not in sync with dns.git
[22:38:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:39:04] <logmsgbot>	 !log kamila@dns1004 START - running authdns-update
[22:39:08] <wikibugs>	 (03CR) 10Pppery: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester)
[22:41:10] <logmsgbot>	 !log kamila@dns1004 END - running authdns-update
[22:41:11] <wikibugs>	 (03CR) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester)
[22:41:19] <kamila_>	 !log switch deployment.w.o DNS to eqiad
[22:41:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:42:26] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] hieradata: update deployment_server to deploy1003 [puppet] - 10https://gerrit.wikimedia.org/r/1129952 (https://phabricator.wikimedia.org/T385155) (owner: 10Kamila Součková)
[22:44:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10659911 (10phaultfinder)
[22:47:11] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet
[22:48:11] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp4047.ulsfo.wmnet
[22:49:19] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet
[22:51:42] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] upgrade cp3076 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129862 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[22:51:58] <wikibugs>	 (03PS2) 10Tim Starling: block: Don't modify an autoblock when the user specifies an IP [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129957 (https://phabricator.wikimedia.org/T389452)
[22:56:09] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3076.esams.wmnet} and A:cp
[22:56:42] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10659939 (10RobH) Support is requiring firmware updates, as this is pretty far out of date.  Current iDrac firmware:  5.10.30.00  Current BIOS firmware:  1.6.5   Support stated we should go from 5.10.30...
[22:58:16] <logmsgbot>	 !log kamila@deploy2002 Unlocked for deployment [MediaWiki]: deployment server switch -- T385155 (duration: 68m 30s)
[22:58:20] <stashbot>	 T385155:  🧭 Northward Datacentre Switchover (March 2025)  - https://phabricator.wikimedia.org/T385155
[23:01:39] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3076.esams.wmnet} and A:cp
[23:02:19] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10659951 (10BCornwall) Thanks for doing this. If you want any assistance on doing the updates, let me know - I'd do it right now but it looks like you might be in the middle of upgrades and I don't wanna...
[23:04:14] <wikibugs>	 (03PS1) 10Kimberly Sarabia: Stream registration for article summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097)
[23:05:49] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp4047.ulsfo.wmnet
[23:06:45] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet
[23:06:45] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10659982 (10RobH) Yeah, it took the same command 3 times for it to finally not time out or break in some way, but it finally updated to cp4047 (IDRAC): now at version: 5.10.50.0 .  Now to move it along u...
[23:10:20] <wikibugs>	 (03CR) 10Tim Starling: [C:03+2] block: Don't modify an autoblock when the user specifies an IP [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129957 (https://phabricator.wikimedia.org/T389452) (owner: 10Tim Starling)
[23:10:53] <logmsgbot>	 !log kamila@deploy1003 Started scap sync-world: Test deployment to validate deployment server switchover - T385155
[23:10:57] <stashbot>	 T385155:  🧭 Northward Datacentre Switchover (March 2025)  - https://phabricator.wikimedia.org/T385155
[23:13:47] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns1005.wikimedia.org
[23:14:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660004 (10phaultfinder)
[23:15:05] <wikibugs>	 (03Merged) 10jenkins-bot: block: Don't modify an autoblock when the user specifies an IP [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129957 (https://phabricator.wikimedia.org/T389452) (owner: 10Tim Starling)
[23:15:43] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp4047.ulsfo.wmnet
[23:18:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10660007 (10Jhancock.wm) 05Open→03Resolved
[23:18:10] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10660011 (10RobH) > Support, >  > Can you confirm you see the failure and what part the failure occurred on with the logs sent over? >  > Updating the firmware now. >  > Please advise, >     > Hi Rob  >...
[23:18:20] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[23:18:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10660012 (10Jhancock.wm) @MoritzMuehlenhoff ready!
[23:18:33] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet
[23:18:44] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[23:19:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660013 (10phaultfinder)
[23:25:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660014 (10phaultfinder)
[23:28:29] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp4047.ulsfo.wmnet
[23:29:44] <robh>	 !log updating cp4047 bios via T387238, server will flap but is not pooled
[23:29:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:29:49] <stashbot>	 T387238: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238
[23:30:20] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet
[23:30:36] <logmsgbot>	 !log kamila@deploy1003 Finished scap sync-world: Test deployment to validate deployment server switchover - T385155 (duration: 19m 42s)
[23:30:39] <stashbot>	 T385155:  🧭 Northward Datacentre Switchover (March 2025)  - https://phabricator.wikimedia.org/T385155
[23:30:43] <wikibugs>	 (03CR) 10Dzahn: "oh.. duh! yea, please ignore that previous comment :)" [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[23:30:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660022 (10phaultfinder)
[23:30:49] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] upgrade cp3077 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129863 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[23:31:21] <kamila_>	 TimStarling: you can deploy if you want
[23:31:36] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4047.ulsfo.wmnet
[23:32:02] <wikibugs>	 (03CR) 10Pppery: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester)
[23:32:05] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3077.esams.wmnet} and A:cp
[23:32:27] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:37:33] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3077.esams.wmnet} and A:cp
[23:38:24] <logmsgbot>	 !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1129957|block: Don't modify an autoblock when the user specifies an IP (T389452)]]
[23:42:13] <logmsgbot>	 !log brett@dns1005 START - running authdns-update
[23:43:43] <logmsgbot>	 !log brett@dns1005 END - running authdns-update
[23:45:47] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cp4047.ulsfo.wmnet
[23:45:55] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp4047.ulsfo.wmnet
[23:46:15] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet
[23:47:28] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4047.ulsfo.wmnet
[23:53:46] <logmsgbot>	 !log tstarling@deploy1003 tstarling: Backport for [[gerrit:1129957|block: Don't modify an autoblock when the user specifies an IP (T389452)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:53:57] <logmsgbot>	 !log tstarling@deploy1003 tstarling: Continuing with sync
[23:54:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660040 (10phaultfinder)
[23:58:10] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cp4047.ulsfo.wmnet
[23:58:13] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp4047.ulsfo.wmnet
[23:58:21] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet
[23:59:26] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4047.ulsfo.wmnet