[00:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [00:02:45] PROBLEM - Webrequests Varnishkafka log producer on cp4042 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:02:45] PROBLEM - statsv Varnishkafka log producer on cp4038 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:02:47] PROBLEM - Webrequests Varnishkafka log producer on cp4043 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:02:47] PROBLEM - Webrequests Varnishkafka log producer on cp4037 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:02:48] PROBLEM - statsv Varnishkafka log producer on cp4039 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:09:25] FIRING: [2x] SystemdUnitFailed: man-db.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:25] FIRING: [2x] SystemdUnitFailed: man-db.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10655153 (10phaultfinder) [00:33:32] 10ops-codfw, 06SRE, 06DC-Ops: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10655170 (10Papaul) [00:35:23] (03PS1) 10Gergő Tisza: Enable SUL3 logins for 1% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) [00:37:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [00:38:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1129551 [00:38:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1129551 (owner: 10TrainBranchBot) [00:39:46] (03CR) 10CI reject: [V:04-1] Enable SUL3 logins for 1% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [00:49:25] FIRING: [2x] SystemdUnitFailed: man-db.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:53:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [00:54:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1129551 (owner: 10TrainBranchBot) [01:08:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1129583 [01:08:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1129583 (owner: 10TrainBranchBot) [01:10:09] (03CR) 10Dzahn: [C:03+1] "Yea, logically it makes sense to me to create the user in mediawiki::system_users, the compiler output looks good and the number of affect" [puppet] - 10https://gerrit.wikimedia.org/r/1129389 (owner: 10Ahmon Dancy) [01:11:39] (03CR) 10Dzahn: "@muehlenhoff this would be after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129389/3" [puppet] - 10https://gerrit.wikimedia.org/r/1129408 (owner: 10Ahmon Dancy) [01:37:41] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1129583 (owner: 10TrainBranchBot) [02:54:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:18] (03PS1) 10Chuckonwumelu: Add new profile [labs/private] - 10https://gerrit.wikimedia.org/r/1129595 [03:32:21] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [04:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [04:33:30] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [04:48:30] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [04:49:25] FIRING: [2x] SystemdUnitFailed: man-db.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:53:30] FIRING: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [04:53:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:38:24] (03CR) 10Arnaudb: [C:03+1] vrts: add parameters for exim_deny_senders from private repo [puppet] - 10https://gerrit.wikimedia.org/r/1129369 (https://phabricator.wikimedia.org/T389356) (owner: 10Dzahn) [05:41:40] (03CR) 10Arnaudb: [C:03+1] vrts: add profile::vrts::exim_deny_senders with fake value [labs/private] - 10https://gerrit.wikimedia.org/r/1129374 (https://phabricator.wikimedia.org/T389079) (owner: 10Dzahn) [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T0600) [06:00:05] marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T0600) [06:05:53] jouncebot: next [06:05:53] In 1 hour(s) and 54 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T0800) [06:06:58] (03PS1) 10Marostegui: db-production.php: Disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129609 (https://phabricator.wikimedia.org/T388627) [06:08:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by marostegui@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129609 (https://phabricator.wikimedia.org/T388627) (owner: 10Marostegui) [06:08:30] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [06:08:46] (03Merged) 10jenkins-bot: db-production.php: Disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129609 (https://phabricator.wikimedia.org/T388627) (owner: 10Marostegui) [06:08:52] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1248 crash - https://phabricator.wikimedia.org/T388837#10655783 (10Marostegui) Thank you! [06:09:46] !log marostegui@deploy2002 Started scap sync-world: Backport for [[gerrit:1129609|db-production.php: Disable writes on es7 (T388627)]] [06:09:50] T388627: Disable circular replication after DC switchover - https://phabricator.wikimedia.org/T388627 [06:13:14] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1129609|db-production.php: Disable writes on es7 (T388627)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:13:17] !log marostegui@deploy2002 marostegui: Continuing with sync [06:13:20] (03PS1) 10Marostegui: db1246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1129610 (https://phabricator.wikimedia.org/T387673) [06:13:44] (03CR) 10Marostegui: [C:03+2] db1246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1129610 (https://phabricator.wikimedia.org/T387673) (owner: 10Marostegui) [06:14:17] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10655795 (10Marostegui) I am automatically slowly pooling this host back [06:14:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P74266 and previous config saved to /var/cache/conftool/dbconfig/20250320-061426-root.json [06:17:04] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129611 [06:17:10] (03CR) 10Marostegui: [C:04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129611 (owner: 10Marostegui) [06:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10655807 (10phaultfinder) [06:20:53] !log marostegui@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129609|db-production.php: Disable writes on es7 (T388627)]] (duration: 11m 07s) [06:20:57] T388627: Disable circular replication after DC switchover - https://phabricator.wikimedia.org/T388627 [06:21:28] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section es7 [06:21:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section es7 [06:22:45] (03CR) 10Marostegui: Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129611 (owner: 10Marostegui) [06:22:47] (03CR) 10Marostegui: [C:03+2] Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129611 (owner: 10Marostegui) [06:23:00] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section es6 [06:23:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section es6 [06:23:37] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129611 (owner: 10Marostegui) [06:23:38] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section x1 [06:23:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section x1 [06:24:27] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s8 [06:24:40] !log marostegui@deploy2002 Started scap sync-world: Backport for [[gerrit:1129611|Revert "db-production.php: Disable writes on es7"]] [06:24:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s8 [06:25:43] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s7 [06:25:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s7 [06:26:16] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s6 [06:26:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s6 [06:26:53] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s5 [06:28:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s5 [06:28:57] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s4 [06:29:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s4 [06:29:30] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1129611|Revert "db-production.php: Disable writes on es7"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:29:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P74267 and previous config saved to /var/cache/conftool/dbconfig/20250320-062931-root.json [06:29:55] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s3 [06:30:07] !log marostegui@deploy2002 marostegui: Continuing with sync [06:30:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s3 [06:30:28] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s2 [06:30:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s2 [06:31:20] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s1 [06:31:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s1 [06:34:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s8 T389367 [06:35:00] T389367: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T389367 [06:35:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2161 with weight 0 T389367', diff saved to https://phabricator.wikimedia.org/P74268 and previous config saved to /var/cache/conftool/dbconfig/20250320-063509-marostegui.json [06:36:39] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1129288 (https://phabricator.wikimedia.org/T389367) (owner: 10Gerrit maintenance bot) [06:37:29] !log marostegui@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129611|Revert "db-production.php: Disable writes on es7"]] (duration: 12m 48s) [06:39:44] !log Starting s8 codfw failover from db2165 to db2161 - T389367 [06:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2161 to s8 primary T389367', diff saved to https://phabricator.wikimedia.org/P74269 and previous config saved to /var/cache/conftool/dbconfig/20250320-064012-marostegui.json [06:40:16] T389367: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T389367 [06:41:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2165 T389367', diff saved to https://phabricator.wikimedia.org/P74270 and previous config saved to /var/cache/conftool/dbconfig/20250320-064131-marostegui.json [06:43:12] (03PS1) 10Marostegui: db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1129725 (https://phabricator.wikimedia.org/T387441) [06:43:30] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2165.codfw.wmnet [06:44:07] (03CR) 10Marostegui: [C:03+2] db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1129725 (https://phabricator.wikimedia.org/T387441) (owner: 10Marostegui) [06:44:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74271 and previous config saved to /var/cache/conftool/dbconfig/20250320-064437-root.json [06:50:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2165.codfw.wmnet [06:51:52] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15:00:00 on db2165.codfw.wmnet with reason: Maintenance [06:54:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:58:27] (03Abandoned) 10Ayounsi: Remove v6 include for e8/f8 uplinks [dns] - 10https://gerrit.wikimedia.org/r/1091711 (https://phabricator.wikimedia.org/T380050) (owner: 10Ayounsi) [06:59:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:59:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74272 and previous config saved to /var/cache/conftool/dbconfig/20250320-065942-root.json [07:02:24] (03CR) 10Btullis: [C:03+2] Upgrade snapshot hosts to PHP version 8.1 - except snapshot1016 [puppet] - 10https://gerrit.wikimedia.org/r/1129343 (https://phabricator.wikimedia.org/T382484) (owner: 10Btullis) [07:04:25] FIRING: [8x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:13:32] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10655910 (10ayounsi) @RobH make sure to link the inbound shipment to the existing ticket, so remote hands can set it up directly. Let's also use the initial positions : port... [07:14:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10655911 (10phaultfinder) [07:14:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74273 and previous config saved to /var/cache/conftool/dbconfig/20250320-071448-root.json [07:24:37] !log rebalance ganeti eqiad/C following reimages T382507 [07:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:41] T382507: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507 [07:25:29] (03PS2) 10Filippo Giunchedi: prometheus: add function to replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128779 (https://phabricator.wikimedia.org/T389170) [07:26:20] (03PS2) 10Filippo Giunchedi: alertmanager: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128780 (https://phabricator.wikimedia.org/T389170) [07:28:47] (03PS1) 10Filippo Giunchedi: kubernetes: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129178 (https://phabricator.wikimedia.org/T389170) [07:28:48] (03PS1) 10Filippo Giunchedi: snmp_exporter: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129179 (https://phabricator.wikimedia.org/T389170) [07:29:25] FIRING: [9x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:29:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74274 and previous config saved to /var/cache/conftool/dbconfig/20250320-072953-root.json [07:31:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet [07:32:21] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [07:32:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet [07:32:37] (03PS1) 10Filippo Giunchedi: pontoon: refactor nova.py with cache [puppet] - 10https://gerrit.wikimedia.org/r/1129370 [07:32:50] (03PS1) 10Filippo Giunchedi: pontoon: refactor Filter to work with CloudHost [puppet] - 10https://gerrit.wikimedia.org/r/1129371 [07:33:18] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti5005 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1129262 (owner: 10Muehlenhoff) [07:34:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:31] !log btullis@deploy2002 Started deploy [dumps/dumps@2fe1059]: Fixing index out of range error [07:35:31] (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [07:35:32] !log btullis@deploy2002 Finished deploy [dumps/dumps@2fe1059]: Fixing index out of range error (duration: 00m 09s) [07:35:44] !log btullis@deploy2002 Started deploy [dumps/dumps@2fe1059]: Fixing index out of range error [07:35:50] !log btullis@deploy2002 Finished deploy [dumps/dumps@2fe1059]: Fixing index out of range error (duration: 00m 09s) [07:35:52] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: refactor nova.py with cache [puppet] - 10https://gerrit.wikimedia.org/r/1129370 (owner: 10Filippo Giunchedi) [07:35:59] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: refactor Filter to work with CloudHost [puppet] - 10https://gerrit.wikimedia.org/r/1129371 (owner: 10Filippo Giunchedi) [07:41:15] (03PS1) 10Elukey: role::ml_k8s::worker: move ml-serve2001 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1129762 (https://phabricator.wikimedia.org/T387854) [07:41:21] !log btullis@deploy2002 Started deploy [dumps/dumps@2fe1059]: Fixing index out of range error [07:41:28] !log btullis@deploy2002 Finished deploy [dumps/dumps@2fe1059]: Fixing index out of range error (duration: 00m 26s) [07:41:36] (03CR) 10CI reject: [V:04-1] role::ml_k8s::worker: move ml-serve2001 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1129762 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:42:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Echo] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129336 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński) [07:42:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Echo] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129362 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński) [07:43:05] (03PS2) 10Elukey: role::ml_k8s::worker: move ml-serve2001 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1129762 (https://phabricator.wikimedia.org/T387854) [07:44:05] (03CR) 10Elukey: [C:03+2] maps: remove Kartotherian from bare metal nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128348 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey) [07:45:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74275 and previous config saved to /var/cache/conftool/dbconfig/20250320-074459-root.json [07:45:02] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5117/" [puppet] - 10https://gerrit.wikimedia.org/r/1129762 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:45:34] (03PS3) 10Elukey: role::ml_k8s::worker: move ml-serve2002 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1129762 (https://phabricator.wikimedia.org/T387854) [07:46:59] !log remove kartotherian from maps* bare metal nodes [07:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:29] (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [07:54:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:57:01] 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10655946 (10fgiunchedi) >>! In T374711#10652054, @jhathaway wrote: >>>! In T374711#10650455, @fgiunchedi wrote: >> There's two parts to keyholder, `-proxy... [07:59:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:26] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti5005.eqsin.wmnet [08:00:05] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T0800). [08:00:05] tgr and MichaelG_WMF: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:13] o/ [08:00:56] my config change is failing with "Script git diff stash@{0} stash@{1} --minimal --color --exit-code handling the diffConfig event returned with error code 1" [08:01:05] I can't make sense of that error [08:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [08:02:12] the 7.4 diffConfig produces the exact same message but passes [08:02:13] since when do we have 8.1 jobs in config? [08:02:27] that seems like a CI-error [08:02:47] we are mostly on 8.1 now so it would make sense [08:03:00] !log taavi@cumin1002 START - Cookbook sre.wikireplicas.update-views [08:03:09] AIUI the job in itself has to "fail" (in the logs) to show the diff, but somehow that is overwritten in the final consideration [08:03:20] makes sense yes, but since when is it actually live? [08:03:27] * MichaelG_WMF looks at previous changes [08:04:00] mine do not have them: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/1129336 [08:04:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:38] (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129229 (https://phabricator.wikimedia.org/T389348) (owner: 10Reedy) [08:04:39] so, probably they were added last night, but a mistake was made and not discovered until now? [08:04:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10655952 (10phaultfinder) [08:05:01] (03CR) 10Ilias Sarantopoulos: [C:03+1] role::ml_k8s::worker: move ml-serve2002 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1129762 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [08:05:46] (03CR) 10Michael Große: "recheck" [extensions/Echo] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129336 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński) [08:05:59] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5005.eqsin.wmnet [08:06:14] (03PS1) 10Muehlenhoff: Initial stub role for mariadb::research [puppet] - 10https://gerrit.wikimedia.org/r/1129764 (https://phabricator.wikimedia.org/T389089) [08:06:39] I guess that would have been https://gerrit.wikimedia.org/r/c/integration/config/+/1129364 ? [08:07:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [08:07:36] sounds plausible [08:08:25] (03CR) 10CI reject: [V:04-1] Initial stub role for mariadb::research [puppet] - 10https://gerrit.wikimedia.org/r/1129764 (https://phabricator.wikimedia.org/T389089) (owner: 10Muehlenhoff) [08:08:53] !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [08:09:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet [08:09:01] (03CR) 10Slyngshede: [C:03+2] Release version 0.1.1 [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1115334 (owner: 10Slyngshede) [08:09:14] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti5006 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1129263 (owner: 10Muehlenhoff) [08:09:25] RESOLVED: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:18] 08:46:18 Script git diff stash@{0} stash@{1} --minimal --color --exit-code handling the diffConfig event returned with error code 1 [08:10:21] 08:46:18 Build step 'Execute shell' marked build as failure [08:10:23] (03CR) 10Vgutierrez: [C:03+2] varnish: X-Requestctl is now being handled by HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1125986 (owner: 10Vgutierrez) [08:10:29] (sorry wrong buffer) [08:10:33] https://gerrit.wikimedia.org/r/c/integration/config/+/1129765 [08:10:36] I think [08:10:48] (03Merged) 10jenkins-bot: Release version 0.1.1 [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1115334 (owner: 10Slyngshede) [08:12:17] other config patches seem to be passing, though? [08:12:32] anyway I can deploy the backport in the meantime [08:13:35] your change looks good, not sure why CI is failing for it [08:13:48] @tgr_ yes, deploying my backports would be great :) [08:14:06] (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128913 (owner: 10Giuseppe Lavagetto) [08:15:20] my backports are not properly testable. We are adding logging so we can figure out in what circumstances a warning occurs, which means we cannot actively trigger it to test it. The warning may or may not show up in the Echo channel [08:15:26] (03CR) 10CI reject: [V:04-1] CA: add timeout for token validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128913 (owner: 10Giuseppe Lavagetto) [08:16:16] (03PS1) 10Arnaudb: gerrit: switchover to gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1128814 (https://phabricator.wikimedia.org/T387833) [08:16:19] (03PS3) 10Arnaudb: gerrit: switchover to gerrit2002 [dns] - 10https://gerrit.wikimedia.org/r/1128818 (https://phabricator.wikimedia.org/T387833) [08:16:36] (03PS4) 10Arnaudb: gerrit: switchover to gerrit2002 [dns] - 10https://gerrit.wikimedia.org/r/1128818 (https://phabricator.wikimedia.org/T387833) [08:16:45] (03PS1) 10Muehlenhoff: Switch ganeti5007 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1129766 [08:16:48] re other config changes: I think CI on [CommonSettings: Migrate CentralNotice to Virtual Domains](https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1129229) because it does not actually change anything? [08:17:11] and for that job, success and failure are swapped. [08:18:36] we are still using codfw for deploying, right? [08:18:53] no idea, I'm sorry [08:19:22] I imagine there would be a motd saying so if we didn't [08:19:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/Echo] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129336 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński) [08:19:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/Echo] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129362 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński) [08:20:11] !log installing python-cryptography security updates [08:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:22] (03CR) 10Gergő Tisza: "(sorry just testing T389460)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128913 (owner: 10Giuseppe Lavagetto) [08:23:13] (03CR) 10CI reject: [V:04-1] Add logging to help figure unserialization issues [extensions/Echo] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129336 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński) [08:23:14] tgr_: I think the phan job on the -wmf.20 backport died. Not sure why [08:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656027 (10phaultfinder) [08:26:15] looks like some issue setting up the workspace? pretty sure that is unrelated to the actual change [08:26:58] (03PS2) 10Muehlenhoff: Initial stub role for mariadb::research [puppet] - 10https://gerrit.wikimedia.org/r/1129764 (https://phabricator.wikimedia.org/T389089) [08:29:17] rsync: [sender] change_dir "/castor-mw-ext-and-skins/wmf-1.44.0-wmf.20/mwext-php74-phan" (in caches) failed: No such file or directory (2) [08:29:44] (03CR) 10Gergő Tisza: [C:03+2] Add logging to help figure unserialization issues [extensions/Echo] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129336 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński) [08:31:02] (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [08:32:11] !log restart swift-proxy on ms-fe2010 T360913 [08:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:15] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [08:33:09] (03Merged) 10jenkins-bot: Add logging to help figure unserialization issues [extensions/Echo] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129362 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński) [08:33:11] (03Merged) 10jenkins-bot: Add logging to help figure unserialization issues [extensions/Echo] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129336 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński) [08:34:41] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1129336|Add logging to help figure unserialization issues (T388725)]], [[gerrit:1129362|Add logging to help figure unserialization issues (T388725)]] [08:34:45] T388725: PHP Warning: Class __PHP_Incomplete_Class has no unserializer - https://phabricator.wikimedia.org/T388725 [08:36:34] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet [08:37:28] (03CR) 10Ayounsi: [C:03+2] network/data.yaml: add sandbox1-b3-magru [puppet] - 10https://gerrit.wikimedia.org/r/1129219 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [08:40:21] (03PS1) 10Majavah: Revert "cloudgw/icmp check/ip6: disabling" [puppet] - 10https://gerrit.wikimedia.org/r/1129770 (https://phabricator.wikimedia.org/T388379) [08:40:37] !log merge/deploy network/data.yaml: add sandbox1-b3-magru [08:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:14] !log tgr@deploy2002 matmarex, tgr: Backport for [[gerrit:1129336|Add logging to help figure unserialization issues (T388725)]], [[gerrit:1129362|Add logging to help figure unserialization issues (T388725)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:41:17] T388725: PHP Warning: Class __PHP_Incomplete_Class has no unserializer - https://phabricator.wikimedia.org/T388725 [08:43:08] !log deploy pfw policy - T389456 [08:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:11] !log tgr@deploy2002 matmarex, tgr: Continuing with sync [08:43:13] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet [08:44:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656067 (10phaultfinder) [08:45:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet [08:46:29] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "cloudgw/icmp check/ip6: disabling" [puppet] - 10https://gerrit.wikimedia.org/r/1129770 (https://phabricator.wikimedia.org/T388379) (owner: 10Majavah) [08:47:42] (03CR) 10Elukey: [C:03+2] role::ml_k8s::worker: move ml-serve2002 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1129762 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [08:48:00] (03CR) 10Majavah: [C:03+2] Revert "cloudgw/icmp check/ip6: disabling" [puppet] - 10https://gerrit.wikimedia.org/r/1129770 (https://phabricator.wikimedia.org/T388379) (owner: 10Majavah) [08:48:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5007.eqsin.wmnet [08:48:56] (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128913 (owner: 10Giuseppe Lavagetto) [08:49:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656099 (10phaultfinder) [08:50:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1128225 (owner: 10Filippo Giunchedi) [08:50:46] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129336|Add logging to help figure unserialization issues (T388725)]], [[gerrit:1129362|Add logging to help figure unserialization issues (T388725)]] (duration: 16m 05s) [08:50:51] T388725: PHP Warning: Class __PHP_Incomplete_Class has no unserializer - https://phabricator.wikimedia.org/T388725 [08:51:02] (03PS2) 10Filippo Giunchedi: base: don't show diff for phaste config [puppet] - 10https://gerrit.wikimedia.org/r/1128225 [08:51:03] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti5007 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1129766 (owner: 10Muehlenhoff) [08:51:27] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] base: don't show diff for phaste config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1128225 (owner: 10Filippo Giunchedi) [08:51:47] and I'm already seeing the new warnings rolling in. Thank you! [08:52:04] (03CR) 10Cathal Mooney: [C:03+1] Remove HE through SG.IX [homer/public] - 10https://gerrit.wikimedia.org/r/1129320 (https://phabricator.wikimedia.org/T386987) (owner: 10Ayounsi) [08:52:38] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve2002.codfw.wmnet with OS bookworm [08:53:03] !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host ml-serve2002 [08:53:12] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [08:53:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:54:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:05] (03CR) 10Ayounsi: "Awesome, thanks!" [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [08:55:11] !log restart swift-proxy on ms-fe1010 T360913 [08:55:11] (03CR) 10Vgutierrez: [C:03+1] nginx: Remove prometheus.lua [puppet] - 10https://gerrit.wikimedia.org/r/1036672 (owner: 10Muehlenhoff) [08:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:15] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [08:57:54] jnuche: is it OK if I run over the window by ~20 min? [08:58:10] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2002 - elukey@cumin1002" [08:58:16] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2002 - elukey@cumin1002" [08:58:16] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:58:16] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache ml-serve2002.codfw.wmnet 43.16.192.10.in-addr.arpa 3.4.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:58:19] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-serve2002.codfw.wmnet 43.16.192.10.in-addr.arpa 3.4.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:58:20] !log elukey@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve2002 [08:58:20] sorry, trying to do too many things at once [08:58:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve2002 [08:58:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve2002 [08:58:39] (almost done setting up a test environment for ProofreadPage though) [08:58:40] trg_: yeah, it's no problem, as you know the train is blocked atm anyway [08:58:58] thanks for working on that btw :) [08:59:19] the fix is easy. Getting to the point where I can test it apparently isn't [09:00:04] jnuche and jeena: MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T0900). Please do the needful. [09:00:32] morning, as just mentioned ^, train blocked on T389430 [09:00:32] T389430: Page and Index namespaces from ProofreadPage extension no longer considered content namespaces since deploy of 1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389430 [09:00:47] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet [09:01:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [09:01:42] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:01:45] (03CR) 10Elukey: [C:03+1] spicerack: convert some @property into methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/1129375 (https://phabricator.wikimedia.org/T389329) (owner: 10Volans) [09:02:31] (03Merged) 10jenkins-bot: Enable SUL3 logins for 1% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129546 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [09:02:41] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm [09:03:00] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1129546|Enable SUL3 logins for 1% of group 1 users (T384153)]] [09:03:03] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [09:06:42] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:07:37] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet [09:08:39] jnuche: so, I don't know anything about ProofreadPage, but locally the siteinfo API tells me Page and Index are content namespaces with the patch, and editing those namespaces works [09:09:02] (I didn't try to recreate the cross-extension conflict locally, but pretty sure this is the right way to fix it) [09:09:11] do we need to find a reviewer for that patch? [09:09:55] ("that patch" being https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/1129771 ) [09:10:21] (we don't, Tpt was faster) [09:10:50] (03CR) 10Vgutierrez: [C:04-2] "ats-tls is no longer in place in the CDN, HAProxy takes care of this" [puppet] - 10https://gerrit.wikimedia.org/r/1124216 (https://phabricator.wikimedia.org/T305794) (owner: 10BryanDavis) [09:11:31] one sec, broke a bunch of tests [09:12:16] !log tgr@deploy2002 tgr: Backport for [[gerrit:1129546|Enable SUL3 logins for 1% of group 1 users (T384153)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:12:20] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [09:12:35] tgr_: ack, do you think we should ping Lucas Werkmeister about taking a look at the patch? he seemed to have more context about the whole thing [09:14:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656174 (10phaultfinder) [09:14:42] (03PS1) 10Vgutierrez: hiera,haproxy: Pass X-Analytics when X-Wikimedia-Debug is active [puppet] - 10https://gerrit.wikimedia.org/r/1129774 (https://phabricator.wikimedia.org/T305794) [09:15:21] (03CR) 10Vgutierrez: [C:04-2] "please see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129774" [puppet] - 10https://gerrit.wikimedia.org/r/1124216 (https://phabricator.wikimedia.org/T305794) (owner: 10BryanDavis) [09:15:32] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129774 (https://phabricator.wikimedia.org/T305794) (owner: 10Vgutierrez) [09:15:35] (03CR) 10Ladsgroup: [C:03+1] Initial stub role for mariadb::research [puppet] - 10https://gerrit.wikimedia.org/r/1129764 (https://phabricator.wikimedia.org/T389089) (owner: 10Muehlenhoff) [09:15:44] (03PS11) 10Ayounsi: netbox: refactor support for GraphQL queries [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 [09:16:14] (03CR) 10Ayounsi: netbox: refactor support for GraphQL queries (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [09:16:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10656186 (10elukey) The reimage seems to fail after provisioning with UEFI, the partitioning step fails. This is the error that I see in /var/log/syslog: ` Mar 20... [09:18:37] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15:00:00 on db2165.codfw.wmnet with reason: Maintenance [09:18:52] (03PS1) 10Marostegui: Revert "db2165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1129775 [09:19:11] (03PS1) 10Elukey: installserver: fix preseed config for puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1129776 (https://phabricator.wikimedia.org/T381274) [09:19:18] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host puppetserver2004.codfw.wmnet with OS bookworm [09:19:19] (03CR) 10Marostegui: [C:03+2] Revert "db2165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1129775 (owner: 10Marostegui) [09:21:21] !log tgr@deploy2002 tgr: Continuing with sync [09:23:32] (03PS1) 10Ladsgroup: Bump thumbnail steps to 30% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129778 (https://phabricator.wikimedia.org/T360589) [09:28:47] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129546|Enable SUL3 logins for 1% of group 1 users (T384153)]] (duration: 25m 47s) [09:28:51] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [09:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656221 (10phaultfinder) [09:29:39] fixed the tests [09:29:48] let's see if Tpt is still watching [09:30:20] can I quickly deploy a config patch in between? [09:30:49] I'm done with mine [09:31:08] (03CR) 10Ladsgroup: [C:03+2] Bump thumbnail steps to 30% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129778 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [09:31:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129778 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [09:31:47] thanks. I'll be done in a couple of minutes [09:31:59] (03Merged) 10jenkins-bot: Bump thumbnail steps to 30% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129778 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [09:32:26] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1129778|Bump thumbnail steps to 30% (T360589)]] [09:32:30] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [09:32:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet [09:34:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet [09:35:23] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti5004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1129261 (owner: 10Muehlenhoff) [09:35:29] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1129778|Bump thumbnail steps to 30% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:37:27] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:37:40] (03CR) 10Ayounsi: [C:03+2] Remove HE through SG.IX [homer/public] - 10https://gerrit.wikimedia.org/r/1129320 (https://phabricator.wikimedia.org/T386987) (owner: 10Ayounsi) [09:38:16] (03Merged) 10jenkins-bot: Remove HE through SG.IX [homer/public] - 10https://gerrit.wikimedia.org/r/1129320 (https://phabricator.wikimedia.org/T386987) (owner: 10Ayounsi) [09:38:59] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [09:39:21] (03CR) 10Tiziano Fogli: [C:03+1] logstash: read k8s-mw topics as needed [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) (owner: 10Filippo Giunchedi) [09:39:39] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:39:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656234 (10phaultfinder) [09:40:14] (03CR) 10Tiziano Fogli: [C:03+1] hieradata: move prometheus k8s instances off prometheus2006 [puppet] - 10https://gerrit.wikimedia.org/r/1129173 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [09:42:11] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1129776 (https://phabricator.wikimedia.org/T381274) (owner: 10Elukey) [09:44:53] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [09:46:22] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129778|Bump thumbnail steps to 30% (T360589)]] (duration: 13m 55s) [09:46:26] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [09:50:09] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve2002.codfw.wmnet with OS bookworm [09:50:40] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve2002.codfw.wmnet with OS bookworm [09:50:58] (03CR) 10Elukey: [C:03+2] installserver: fix preseed config for puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1129776 (https://phabricator.wikimedia.org/T381274) (owner: 10Elukey) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:03] (03PS16) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) [09:52:06] (03CR) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [09:52:22] jnuche: I don't know what to do about the remaining CI error. It seems related to the patch but I can't reproduce it locally. We can either accept the CI break, force it through and test it in production (the test is a structure test for Special:Longpages so that would be straightforward), or wait until someone more familiar with Proofreadpage and/or namespace handling shows up. [09:54:34] tgr_: if we backport up to the mwdebug servers, could you run the test there before syncing out everywhere else? [09:55:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/main (k8s) 1.602s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:55:48] running PHPUnit tests in production sounds scary [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:39] probably wouldn't work anyway, no composer dev dependencies etc. If it did work, I would be afraid of it making live DB or cache changes somehow. [09:56:49] If you mean test manually, sure I can do that [09:57:01] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti5004.eqsin.wmnet [09:57:08] tgr_: yeah, I meant the manual test for Special:Longpages you were proposing [09:58:06] yeah I can do that [09:58:19] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/1129771/comments/d26bba0b_8fbbfe91 [09:58:41] https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-noselenium/82271/consoleFull [09:58:53] looks quite nasty, like some kind of loop when generating namespaces [09:58:57] That db query can ... cause issues,let's say [09:58:58] give me a min, I'm still trying to wrap my head around the CI errors [09:59:08] but I reviewed the ProofreadPage code and it's definitely loop-free [09:59:30] the tests pass locally, and NamespaceInfo is used in all kinds of places, not just those two special pages [09:59:36] so no clue what's going on there [09:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656273 (10phaultfinder) [09:59:40] maybe something is wrong with SpecialLongPages? [10:00:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/main (k8s) 1.253s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:00:58] maybe [10:01:05] the code seems pretty normal: https://gerrit.wikimedia.org/g/mediawiki/core/+/c24c8735d78abf33a8ed475c88379ac7588ce213/includes/specials/SpecialShortPages.php#60 [10:01:31] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [10:01:50] ok, yeah, that doesn't look good. I'd feel better if we can have someone else to take a look before merging [10:04:09] reverse-engineering from that test error, it seems like NamespaceInfo::getContentNamespaces() repeats an infinite number of times [10:04:34] ...repeats the ProofreadPage namespaces an infinite number of times [10:04:39] but only in that one test [10:07:01] there must be some sort of loop that causes the MediaWikiServices hook to be called infinite times [10:07:44] hm, do we isolate globals between tests? [10:08:26] maybe it's just a matter of ProofreadPage manipulating globals directly, and then every time a ProofreadPage testcase runs, it adds more namespaces [10:09:32] !log installing gunicorn security updates [10:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:49] tgr_: yes possibly there's a $wgContentNamespaces[] = $wgProofreadPageNamespaceIds[$key] [10:13:07] (03PS1) 10Muehlenhoff: klaxon: Drop conditional for buster [puppet] - 10https://gerrit.wikimedia.org/r/1129784 [10:13:24] (03PS2) 10Muehlenhoff: klaxon: Drop conditional for buster [puppet] - 10https://gerrit.wikimedia.org/r/1129784 [10:14:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656302 (10phaultfinder) [10:18:50] (03CR) 10Muehlenhoff: [C:03+2] Initial stub role for mariadb::research [puppet] - 10https://gerrit.wikimedia.org/r/1129764 (https://phabricator.wikimedia.org/T389089) (owner: 10Muehlenhoff) [10:21:17] (03CR) 10Muehlenhoff: [C:03+2] Bitu: Add approval config for airflow-research-ops [puppet] - 10https://gerrit.wikimedia.org/r/1128357 (owner: 10Muehlenhoff) [10:21:42] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve2002.codfw.wmnet with OS bookworm [10:22:25] (03CR) 10Vgutierrez: [C:04-1] haproxy: using tmpfs directory for private tls material (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [10:24:13] (03CR) 10Filippo Giunchedi: [C:03+1] klaxon: Drop conditional for buster [puppet] - 10https://gerrit.wikimedia.org/r/1129784 (owner: 10Muehlenhoff) [10:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656342 (10phaultfinder) [10:25:09] (03PS2) 10Filippo Giunchedi: misc: report search-grafana-dashboards results details in markdown [software] - 10https://gerrit.wikimedia.org/r/1129242 [10:25:33] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472 (10elukey) 03NEW [10:26:03] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472#10656369 (10elukey) The host is completely depooled, please take any action that you need to do :) [10:26:56] (03PS1) 10Vgutierrez: acme_chief::cloud: Avoid leaking designate secrets [puppet] - 10https://gerrit.wikimedia.org/r/1129786 [10:28:26] (03CR) 10MVernon: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1129377 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [10:31:13] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm [10:33:04] (03CR) 10Muehlenhoff: [C:03+2] klaxon: Drop conditional for buster [puppet] - 10https://gerrit.wikimedia.org/r/1129784 (owner: 10Muehlenhoff) [10:34:13] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1129786 (owner: 10Vgutierrez) [10:34:29] (03CR) 10Vgutierrez: [C:03+2] acme_chief::cloud: Avoid leaking designate secrets [puppet] - 10https://gerrit.wikimedia.org/r/1129786 (owner: 10Vgutierrez) [10:38:41] !log restart imposm.service on maps1009 - T389462 [10:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:45] T389462: OSM replication lag on maps1009 - https://phabricator.wikimedia.org/T389462 [10:39:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656421 (10phaultfinder) [10:42:58] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host puppetserver2004.codfw.wmnet with OS bookworm [10:43:14] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm [10:44:23] !log installing Java security updates on idp hosts [10:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:43] (03PS6) 10Ayounsi: Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) [10:44:43] (03PS2) 10Ayounsi: Add transit/peering in/out port saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/1128429 (https://phabricator.wikimedia.org/T384052) [10:47:12] (03CR) 10Ayounsi: "Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1128429 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [10:51:03] (03PS2) 10Phuedx: ext-EventStreamConfig: Reduce product_metrics.web_base data collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129270 [10:55:36] jnuche: we are good to go, but out of time I guess? [10:55:59] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'clear' for AS: 52999 [10:56:08] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'clear' for AS: 52999 [10:58:10] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting P{lvs7003.magru.wmnet,lvs1013.eqiad.wmnet} and A:liberica [10:58:17] (03PS17) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) [10:58:31] (03CR) 10Fabfur: haproxy: using tmpfs directory for private tls material (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [10:59:04] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P{lvs7003.magru.wmnet,lvs1013.eqiad.wmnet} and A:liberica [10:59:07] jouncebot: nowandnext [10:59:07] For the next 0 hour(s) and 0 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T0900) [10:59:07] In 0 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1100) [10:59:25] going to ask if we can squeeze in the backport [10:59:38] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting P{lvs[7001-7002].magru.wmnet} and A:liberica [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1100) [11:00:42] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:00:57] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P{lvs[7001-7002].magru.wmnet} and A:liberica [11:04:02] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [11:05:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:05:59] (03PS5) 10Slyngshede: Upgrade CAS to version 7.1.4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1111636 [11:08:15] (03CR) 10Filippo Giunchedi: [C:03+1] Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:10:52] (03CR) 10Vgutierrez: [C:04-1] "current CR breaks OCSP response stapling for certificates deployed by sslcert::certificate" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [11:13:37] (03PS1) 10Muehlenhoff: Failover idp to idp1004 [dns] - 10https://gerrit.wikimedia.org/r/1129788 [11:14:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656580 (10phaultfinder) [11:15:52] tgr_: if you're still around, I think we can go ahead with backporting the fix [11:17:49] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host puppetserver2004.codfw.wmnet with OS bookworm [11:18:00] yay! [11:18:29] zuul on master seems hopelessly backlogged but the normal tests pass so I think that's good enough [11:18:56] (03PS1) 10Gergő Tisza: Use MediaWikiServices for early config changes [extensions/ProofreadPage] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129789 (https://phabricator.wikimedia.org/T288819) [11:19:14] looks like the gate jobs finally made it through [11:19:32] (as in, started running) [11:19:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656606 (10phaultfinder) [11:20:17] are you backporting, or should I? [11:22:35] tgr_: can you do the honors? :) [11:22:45] tgr_: wait [11:23:08] (03PS1) 10Muehlenhoff: Add db1300 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1129790 (https://phabricator.wikimedia.org/T389089) [11:23:36] tgr_: nvm, I thought SRE may have an issue with backporting now [11:23:39] seems it's ok [11:26:13] (03CR) 10Muehlenhoff: [C:03+2] Add db1300 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1129790 (https://phabricator.wikimedia.org/T389089) (owner: 10Muehlenhoff) [11:26:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/ProofreadPage] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129789 (https://phabricator.wikimedia.org/T288819) (owner: 10Gergő Tisza) [11:31:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host db1300.eqiad.wmnet [11:31:13] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:32:21] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [11:33:08] (03PS1) 10Brouberol: Query the wiki API through envoy when running in kubernetes [dumps] - 10https://gerrit.wikimedia.org/r/1129793 (https://phabricator.wikimedia.org/T388378) [11:35:17] (03PS18) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) [11:36:18] (03CR) 10Fabfur: haproxy: using tmpfs directory for private tls material (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [11:37:27] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db1300.eqiad.wmnet - jmm@cumin2002" [11:37:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db1300.eqiad.wmnet - jmm@cumin2002" [11:37:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:37:32] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache db1300.eqiad.wmnet on all recursors [11:37:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db1300.eqiad.wmnet on all recursors [11:37:38] !log instaling debootstrap bugfix updates [11:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:42] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [11:38:05] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db1300.eqiad.wmnet - jmm@cumin2002" [11:38:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db1300.eqiad.wmnet - jmm@cumin2002" [11:39:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74277 and previous config saved to /var/cache/conftool/dbconfig/20250320-113918-root.json [11:40:27] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10656640 (10MoritzMuehlenhoff) [11:40:44] (03Merged) 10jenkins-bot: Use MediaWikiServices for early config changes [extensions/ProofreadPage] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129789 (https://phabricator.wikimedia.org/T288819) (owner: 10Gergő Tisza) [11:41:16] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1129789|Use MediaWikiServices for early config changes (T288819 T389430)]] [11:41:20] T288819: NamespaceInfo service missing namespaces if initialized too early - https://phabricator.wikimedia.org/T288819 [11:41:20] T389430: Page and Index namespaces from ProofreadPage extension no longer considered content namespaces since deploy of 1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389430 [11:42:13] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm [11:42:43] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, one comment in line." [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:42:55] dcausse: should I test something specific for the ProofreadPage patch, other than namespace info in the siteinfo API? [11:44:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656659 (10phaultfinder) [11:45:10] (03PS1) 10Zoe: Re-enable creation of Flow pages for sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129795 [11:45:38] (03CR) 10Btullis: [C:03+1] Query the wiki API through envoy when running in kubernetes [dumps] - 10https://gerrit.wikimedia.org/r/1129793 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [11:46:51] (03CR) 10Cathal Mooney: "Nice! Overall it lgtm if everyone is in agreement." [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [11:47:11] (03PS2) 10Zoe: Re-enable creation of Flow pages for sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129795 (https://phabricator.wikimedia.org/T380911) [11:47:40] (03CR) 10Cathal Mooney: [C:03+1] Add transit/peering in/out port saturation alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1128429 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [11:48:12] !log tgr@deploy2002 tgr: Backport for [[gerrit:1129789|Use MediaWikiServices for early config changes (T288819 T389430)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:48:17] T288819: NamespaceInfo service missing namespaces if initialized too early - https://phabricator.wikimedia.org/T288819 [11:48:17] T389430: Page and Index namespaces from ProofreadPage extension no longer considered content namespaces since deploy of 1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389430 [11:51:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host db1300.eqiad.wmnet with OS bookworm [11:53:44] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage [11:54:13] jnuche: I'm trying to find a wiki where I can test the fix; https://versions.toolforge.org/ says group 0 is on wmf.21, but https://test2.wikipedia.org/wiki/Special:Version says it's on wmf.20 [11:54:20] I guess the version tool is wrong? [11:54:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74278 and previous config saved to /var/cache/conftool/dbconfig/20250320-115423-root.json [11:55:14] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10656702 (10Ladsgroup) [11:55:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656703 (10phaultfinder) [11:56:01] tgr_: `test2` is actually group1, but `test` should have the fix if you can test there: https://test.wikipedia.org/wiki/Special:Version [11:56:35] it doesn't use ProofReadpage though [11:56:52] I guess not really testable then [11:57:08] I can test during train rollout if that's OK [11:57:22] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage [11:57:23] jouncebot: nowandnext [11:57:24] For the next 0 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1100) [11:57:24] In 0 hour(s) and 2 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1200) [11:57:34] tgr_: yeah, let's do that [11:57:53] ok, I'm going to roll out the train to group2 in a few minutes [11:57:55] or I guess closed wikisources would be group 0 [11:58:28] I see a couple of wikisource wikis in group0, yes [11:58:35] you want me to hold on? [11:59:04] (03CR) 10Brouberol: [C:03+2] Query the wiki API through envoy when running in kubernetes [dumps] - 10https://gerrit.wikimedia.org/r/1129793 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [11:59:49] eh [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1200) [12:00:28] I'm trying ht.wikisource which is definitely wmf.21, but even without the fix the siteinfo API says all the ProofreadPage namespaces are content [12:00:42] so maybe the bug only occurs in a more specific situation? [12:00:46] let me finish the backport [12:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [12:01:10] or will the train scap take care of that anyway? [12:02:21] tgr_: yeah, if you finish the backport you will get the fix in ht.wikisource [12:02:32] sorry, I didn't realize you had stopped at testservers sync [12:02:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1300.eqiad.wmnet with reason: host reimage [12:02:45] it should be safe to finish, you should go ahead [12:03:20] I mean the fix is now on the test servers but absolutely no difference in behavior with or without, I can't reproduce the bug [12:03:26] !log tgr@deploy2002 tgr: Continuing with sync [12:03:56] I also can't test Special:Longpages because in production that's a daily job (although I'm very confident that was just a cross-test pollution issue) [12:04:32] tgr_: sry, missed that, doing couple things at the same time [12:04:59] if the problem still persists, do you have an idea how bad the impact could be if we roll all the way to group2? [12:05:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1300.eqiad.wmnet with reason: host reimage [12:05:24] *in case the problem still persists after your fix [12:05:50] (03PS1) 10Clément Goubert: modules.cache.mcrouter: Copy for new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129802 (https://phabricator.wikimedia.org/T389480) [12:05:57] (03PS1) 10Clément Goubert: modules.cache.mcrouter: Allow exporter port config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129803 (https://phabricator.wikimedia.org/T389480) [12:06:12] (03PS1) 10Clément Goubert: mcrouter: Update cache.mcrouter to 1.3.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129804 (https://phabricator.wikimedia.org/T389480) [12:06:16] it would break search, no idea about timing (how much time until it affects the ES index / how much time until a revert affects the index) [12:06:41] (03PS1) 10Aklapper: phabricator weekly changes email: List tasks "in progress" for >2y [puppet] - 10https://gerrit.wikimedia.org/r/1129806 (https://phabricator.wikimedia.org/T380300) [12:07:12] FWIW we had the exact same issue with Wikibase a week ago and the same fix worked there [12:08:31] EBernhardson added a few notes here on how they debugged the issue: https://phabricator.wikimedia.org/T389430#10654683 [12:08:56] would it be possible to do the same thing in mwdebug1002 right now and verify we get a trace similar to wmf.20? [12:09:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74279 and previous config saved to /var/cache/conftool/dbconfig/20250320-120928-root.json [12:10:51] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129789|Use MediaWikiServices for early config changes (T288819 T389430)]] (duration: 29m 34s) [12:10:56] T288819: NamespaceInfo service missing namespaces if initialized too early - https://phabricator.wikimedia.org/T288819 [12:10:56] T389430: Page and Index namespaces from ProofreadPage extension no longer considered content namespaces since deploy of 1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389430 [12:11:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:11:24] similar to wmf.20, or whatever load trace is expected after the changes [12:12:26] I do see the 250/252 namespaces [12:13:54] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver2004.codfw.wmnet with OS bookworm [12:14:14] but then I'm pretty sure those commands are identical to looking at the siteinfo API [12:14:23] tgr_: that sounds promising, how about I roll out to group1 and then we check in one of the wikis with ProofreadPage there? [12:14:41] so for some reason the bug doesn't seem reproducible on the group0 wikisources in the first place [12:14:50] yeah, let's do that [12:14:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10656759 (10elukey) Host up and running with UEFI and Bookworm :) [12:15:01] all aboard the train [12:15:17] (03CR) 10Muehlenhoff: [C:03+2] Failover idp to idp1004 [dns] - 10https://gerrit.wikimedia.org/r/1129788 (owner: 10Muehlenhoff) [12:15:29] !log jmm@dns1004 START - running authdns-update [12:15:33] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129809 (https://phabricator.wikimedia.org/T386216) [12:15:34] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129809 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [12:16:15] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:16:26] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129809 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [12:17:38] !log jmm@dns1004 END - running authdns-update [12:18:15] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:21:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1300.eqiad.wmnet with OS bookworm [12:21:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host db1300.eqiad.wmnet [12:23:15] RESOLVED: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74280 and previous config saved to /var/cache/conftool/dbconfig/20250320-122433-root.json [12:28:39] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.21 refs T386216 [12:28:43] T386216: 1.44.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T386216 [12:30:07] tgr_: we're at group1 [12:32:24] I re-did ebernhardson's tests on the same wiki he used and it looks correct (the namespace IDs include 100/102 for all three commands) [12:33:55] 🎉 [12:34:27] awesome, going to wait a couple of minutes and then I'll continue deploying to group2 [12:34:54] tgr_: thanks for the fix and following up on this [12:36:23] !log installing openjdk 17 security updates on puppet servers (the necessary restarts may cause a few interrupted puppet runs and will be splayed out) [12:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74281 and previous config saved to /var/cache/conftool/dbconfig/20250320-123939-root.json [12:43:01] (03PS1) 10Effie Mouzeli: mw-mcrouter: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129816 [12:44:52] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10656864 (10phaultfinder) [12:45:00] (03Abandoned) 10Kosta Harlan: GlobalUserSelectQueryBuilder: Ignore unattached local users [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126980 (https://phabricator.wikimedia.org/T388125) (owner: 10Máté Szabó) [12:45:27] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129817 (https://phabricator.wikimedia.org/T386216) [12:45:31] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129817 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [12:45:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:48:16] ^ the drmrs failures are transient, caused by the Java update on puppet servers [12:49:56] (03CR) 10Jaime Nuche: [V:03+2] group2 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129817 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [12:50:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web/main (k8s) 1.378s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:51:18] (03CR) 10Effie Mouzeli: "I think setting monitoring.named_ports:true will tidy things up" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129803 (https://phabricator.wikimedia.org/T389480) (owner: 10Clément Goubert) [12:53:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:53:52] (03PS1) 10Sergio Gimeno: analytics(GrowthExperimentsInteractionLogger): add edit_count to the event data [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129819 (https://phabricator.wikimedia.org/T388622) [12:54:20] (03PS1) 10Sergio Gimeno: feat(SurfacingStructuredTasks): increase max edit cap to 100 [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129820 (https://phabricator.wikimedia.org/T388622) [12:54:50] FIRING: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:55:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [12:55:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129819 (https://phabricator.wikimedia.org/T388622) (owner: 10Sergio Gimeno) [12:55:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web/main (k8s) 1.378s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:55:24] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [12:55:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129820 (https://phabricator.wikimedia.org/T388622) (owner: 10Sergio Gimeno) [12:56:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of testvm2002.codfw.wmnet to drbd [13:00:07] (03PS1) 10Gergő Tisza: Enable SUL3 login for 10% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129821 (https://phabricator.wikimedia.org/T384153) [13:01:13] jouncebot: now [13:01:13] For the next 0 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1300) [13:02:12] looks like the bot stopped announcing windows [13:02:18] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.21 refs T386216 [13:02:22] T386216: 1.44.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T386216 [13:02:50] yep, maybe dst confusion? [13:02:53] o/ [13:03:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of testvm2002.codfw.wmnet to drbd [13:03:38] I'm gonna self-deploy my changes [13:03:50] please hold [13:04:05] train just finished deploying, I need to check logs [13:04:13] sergi0: ^ [13:04:26] ack [13:04:58] (03PS2) 10Filippo Giunchedi: logstash: read k8s-mw topics as needed [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) [13:05:57] (03PS1) 10Effie Mouzeli: thumbor: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129824 (https://phabricator.wikimedia.org/T389480) [13:05:57] (03CR) 10Filippo Giunchedi: [C:03+1] mcrouter: Update cache.mcrouter to 1.3.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129804 (https://phabricator.wikimedia.org/T389480) (owner: 10Clément Goubert) [13:06:37] I might add a config patch in a while [13:10:09] sergi0: thanks for waiting, you can go ahead with backports [13:10:22] great, ty! [13:10:34] (03PS2) 10Effie Mouzeli: thumbor: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129824 (https://phabricator.wikimedia.org/T389480) [13:10:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129819 (https://phabricator.wikimedia.org/T388622) (owner: 10Sergio Gimeno) [13:10:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129820 (https://phabricator.wikimedia.org/T388622) (owner: 10Sergio Gimeno) [13:11:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [13:11:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet [13:12:59] (03Merged) 10jenkins-bot: analytics(GrowthExperimentsInteractionLogger): add edit_count to the event data [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129819 (https://phabricator.wikimedia.org/T388622) (owner: 10Sergio Gimeno) [13:14:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [13:17:50] (03Abandoned) 10Effie Mouzeli: common.yaml: remove firewall rules for kafka-main100[1-5] [puppet] - 10https://gerrit.wikimedia.org/r/1100807 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [13:19:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657024 (10phaultfinder) [13:20:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:25:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657055 (10phaultfinder) [13:27:54] (03Merged) 10jenkins-bot: feat(SurfacingStructuredTasks): increase max edit cap to 100 [extensions/GrowthExperiments] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129820 (https://phabricator.wikimedia.org/T388622) (owner: 10Sergio Gimeno) [13:28:14] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1129819|analytics(GrowthExperimentsInteractionLogger): add edit_count to the event data (T388622)]], [[gerrit:1129820|feat(SurfacingStructuredTasks): increase max edit cap to 100 (T388622)]] [13:28:17] T388622: Increase target audience for Surfacing Structured Task Experiment - https://phabricator.wikimedia.org/T388622 [13:29:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet [13:30:36] !log remove ganeti-test2001 for reimage T382515 [13:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:40] T382515: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515 [13:31:06] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1129819|analytics(GrowthExperimentsInteractionLogger): add edit_count to the event data (T388622)]], [[gerrit:1129820|feat(SurfacingStructuredTasks): increase max edit cap to 100 (T388622)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:32:01] !log sgimeno@deploy2002 sgimeno: Continuing with sync [13:34:28] (03PS1) 10Slyngshede: P:mirrors add file age exporter [puppet] - 10https://gerrit.wikimedia.org/r/1129827 (https://phabricator.wikimedia.org/T350694) [13:35:07] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129828 [13:35:23] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129829 [13:35:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129821 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [13:36:33] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129830 [13:36:43] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129831 [13:38:31] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5118/co" [puppet] - 10https://gerrit.wikimedia.org/r/1129827 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:39:14] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129819|analytics(GrowthExperimentsInteractionLogger): add edit_count to the event data (T388622)]], [[gerrit:1129820|feat(SurfacingStructuredTasks): increase max edit cap to 100 (T388622)]] (duration: 11m 00s) [13:39:18] T388622: Increase target audience for Surfacing Structured Task Experiment - https://phabricator.wikimedia.org/T388622 [13:40:30] I'm done with my changes, tgr_ you want to take yours or I can do it if you want [13:41:19] sergi0: sure, thanks [13:42:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129821 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [13:42:29] it's not really testable [13:44:05] (03CR) 10Muehlenhoff: [C:03+2] Create EFI-enabled Partman recipe for Ganeti in core sites [puppet] - 10https://gerrit.wikimedia.org/r/1128875 (owner: 10Muehlenhoff) [13:44:08] ok, just curious, what signal do you normally look at after a SUL3 rollout? [13:44:24] (03Merged) 10jenkins-bot: Enable SUL3 login for 10% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129821 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [13:44:43] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1129821|Enable SUL3 login for 10% of group 1 users (T384153)]] [13:44:47] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [13:45:50] (03PS1) 10Muehlenhoff: Switch ganeti-test2001 to EFI [puppet] - 10https://gerrit.wikimedia.org/r/1129832 (https://phabricator.wikimedia.org/T382515) [13:47:47] !log sgimeno@deploy2002 tgr, sgimeno: Backport for [[gerrit:1129821|Enable SUL3 login for 10% of group 1 users (T384153)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:48:03] tgr_: should I proceed with sync then? [13:48:13] yes, thanks [13:48:19] !log sgimeno@deploy2002 tgr, sgimeno: Continuing with sync [13:48:32] I'll look at error logs in a few hours [13:48:49] 👍 [13:49:18] plus we have some statsd charts about authentication action frequencies and error rates [13:49:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657160 (10phaultfinder) [13:50:03] admittedly not terribly useful because there are so many weird scrapers which almost but not quite simulate human browsing behavior, it's mostly noise [13:50:13] (03PS2) 10Slyngshede: P:mirrors add file age exporter [puppet] - 10https://gerrit.wikimedia.org/r/1129827 (https://phabricator.wikimedia.org/T350694) [13:50:24] so in practice it's mostly just error logs and human error reports [13:51:05] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5119/co" [puppet] - 10https://gerrit.wikimedia.org/r/1129827 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:51:47] (03CR) 10Vgutierrez: [C:04-1] "please add an additional check that ensure that no cert is being configured to use the on-disk paths if the volatile TLS storage is enable" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [13:52:35] gotcha, thanks for explaining [13:53:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:55:12] (03CR) 10Arturo Borrero Gonzalez: Add new profile (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1129595 (owner: 10Chuckonwumelu) [13:56:02] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129821|Enable SUL3 login for 10% of group 1 users (T384153)]] (duration: 11m 18s) [13:56:06] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [13:56:17] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:56:50] tgr_: your change is live [13:57:22] (03CR) 10Xcollazo: [C:03+1] "(Post merge +1, for completeness, and as per Slack conversations.)" [puppet] - 10https://gerrit.wikimedia.org/r/1129343 (https://phabricator.wikimedia.org/T382484) (owner: 10Btullis) [13:57:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:57:44] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:58:11] thanks! [13:58:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:59:59] (03PS3) 10DCausse: cirrus-streaming-updater: consume from new v1 & legacy rc0 streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124484 (https://phabricator.wikimedia.org/T375821) [13:59:59] (03PS3) 10DCausse: cirrus-streaming-updater: produce to v1 update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124485 (https://phabricator.wikimedia.org/T375821) [13:59:59] (03PS3) 10DCausse: cirrus-streaming-updater: stop consuming from legacy streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124486 (https://phabricator.wikimedia.org/T375821) [14:00:02] 06SRE, 10observability, 10Observability-Logging, 10Wikimedia-Logstash: pybal logs into logstash - https://phabricator.wikimedia.org/T223924#10657173 (10fgiunchedi) 05Open→03Declined pybal is being replaced by liberica [14:00:04] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10657175 (10Jhancock.wm) [14:01:00] jouncebot: nowandnext [14:01:00] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [14:01:01] In 0 hour(s) and 58 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1500) [14:01:38] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10657185 (10Jhancock.wm) absolutely agree after the all the work I see y'all doing. I've pulled a random disk and reinserted. l... [14:01:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10657186 (10Jhancock.wm) a:03Jhancock.wm [14:02:42] (03PS1) 10Dreamy Jazz: GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129837 (https://phabricator.wikimedia.org/T389187) [14:03:12] (03PS1) 10Dreamy Jazz: GlobalContributionsPagerTest: De-duplicate getting new pager [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129838 [14:03:51] (03PS2) 10Dreamy Jazz: GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129837 (https://phabricator.wikimedia.org/T389187) [14:04:03] (03CR) 10Dreamy Jazz: [C:03+2] GlobalContributionsPagerTest: De-duplicate getting new pager [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129838 (owner: 10Dreamy Jazz) [14:04:06] (03CR) 10Dreamy Jazz: [C:03+2] GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129837 (https://phabricator.wikimedia.org/T389187) (owner: 10Dreamy Jazz) [14:06:42] (03PS1) 10Dreamy Jazz: GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129839 (https://phabricator.wikimedia.org/T389187) [14:08:15] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: consume from new v1 & legacy rc0 streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124484 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [14:08:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/main (k8s) 1.542s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:08:39] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: upgrade search plugins - bking@cumin2002 - T389119 [14:08:43] T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119 [14:08:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:09:52] Going to deploy some wmf backports [14:10:09] (03PS2) 10Dreamy Jazz: GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129839 (https://phabricator.wikimedia.org/T389187) [14:10:16] (03Merged) 10jenkins-bot: cirrus-streaming-updater: consume from new v1 & legacy rc0 streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124484 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [14:10:29] (03CR) 10Dreamy Jazz: [C:03+2] GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129839 (https://phabricator.wikimedia.org/T389187) (owner: 10Dreamy Jazz) [14:11:00] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: upgrade search plugins - bking@cumin2002 - T389119 [14:11:33] 14SRE-grizzly-sprint, 10Observability-Metrics: Grizzly: upgrade to 0.2 - https://phabricator.wikimedia.org/T332892#10657217 (10fgiunchedi) 05Open→03Invalid We have replaced Grizzly with Pyrra [14:12:05] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119 [14:12:20] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:12:37] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:13:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/main (k8s) 1.281s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:13:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129837 (https://phabricator.wikimedia.org/T389187) (owner: 10Dreamy Jazz) [14:13:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129838 (owner: 10Dreamy Jazz) [14:13:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129839 (https://phabricator.wikimedia.org/T389187) (owner: 10Dreamy Jazz) [14:21:44] (03PS1) 10Gergő Tisza: Clear stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) [14:22:47] (03CR) 10CI reject: [V:04-1] Clear stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) (owner: 10Gergő Tisza) [14:23:14] (03PS1) 10Vgutierrez: sre: Add LibericaStaleConfig alert [alerts] - 10https://gerrit.wikimedia.org/r/1129846 (https://phabricator.wikimedia.org/T389175) [14:23:25] (03Merged) 10jenkins-bot: GlobalContributionsPagerTest: De-duplicate getting new pager [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129838 (owner: 10Dreamy Jazz) [14:23:26] (03Merged) 10jenkins-bot: GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129837 (https://phabricator.wikimedia.org/T389187) (owner: 10Dreamy Jazz) [14:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657242 (10phaultfinder) [14:27:53] (03CR) 10Bking: "Per Slack conversation with @aotto@wikimedia.org, DPE should not be affected. CCing our Search Platform SWEs for review" [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) (owner: 10Filippo Giunchedi) [14:32:27] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:33:39] (03CR) 10Ahmon Dancy: "Confirmed. In T383947 new groups "spiderpig-users" and "spiderpig-admins" are proposed (although the latter is probably not needed)." [puppet] - 10https://gerrit.wikimedia.org/r/1129292 (https://phabricator.wikimedia.org/T383947) (owner: 10Muehlenhoff) [14:33:42] (03Merged) 10jenkins-bot: GlobalContributions: Do not look up permissions for registered target [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129839 (https://phabricator.wikimedia.org/T389187) (owner: 10Dreamy Jazz) [14:34:02] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1129837|GlobalContributions: Do not look up permissions for registered target (T389187)]], [[gerrit:1129838|GlobalContributionsPagerTest: De-duplicate getting new pager]], [[gerrit:1129839|GlobalContributions: Do not look up permissions for registered target (T389187)]] [14:34:06] T389187: GlobalContributions: Make displaying deleted revisions optional - https://phabricator.wikimedia.org/T389187 [14:35:06] (03CR) 10Bking: "Upon further review, Search Platform SWEs do not believe we are affected by this change. Feel free to move forward." [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) (owner: 10Filippo Giunchedi) [14:35:54] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119 [14:35:57] T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119 [14:36:27] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:38:50] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1129837|GlobalContributions: Do not look up permissions for registered target (T389187)]], [[gerrit:1129838|GlobalContributionsPagerTest: De-duplicate getting new pager]], [[gerrit:1129839|GlobalContributions: Do not look up permissions for registered target (T389187)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:40:08] (03CR) 10Ssingh: [C:03+1] "Thanks for the runbook link!" [alerts] - 10https://gerrit.wikimedia.org/r/1129846 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez) [14:41:29] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [14:42:01] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:42:47] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM! Thank you for sharing some more of the computing resources :-)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129190 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [14:43:59] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM! Thank you." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129191 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [14:44:53] (03CR) 10Brouberol: [C:03+2] airflow-main: increase the scheduler resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129190 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [14:44:56] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s/main: allow egress to cassandra-analytics-query-service-storage-a-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129191 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [14:46:23] (03Merged) 10jenkins-bot: airflow-main: increase the scheduler resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129190 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [14:46:27] (03Merged) 10jenkins-bot: airflow-test-k8s/main: allow egress to cassandra-analytics-query-service-storage-a-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129191 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [14:48:09] (03CR) 10Filippo Giunchedi: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1129827 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:49:07] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129837|GlobalContributions: Do not look up permissions for registered target (T389187)]], [[gerrit:1129838|GlobalContributionsPagerTest: De-duplicate getting new pager]], [[gerrit:1129839|GlobalContributions: Do not look up permissions for registered target (T389187)]] (duration: 15m 04s) [14:49:11] T389187: GlobalContributions: Make displaying deleted revisions optional - https://phabricator.wikimedia.org/T389187 [14:49:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:49:56] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:50:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:50:08] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:51:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [14:51:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [14:52:24] (03CR) 10Vgutierrez: [C:03+2] sre: Add LibericaStaleConfig alert [alerts] - 10https://gerrit.wikimedia.org/r/1129846 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez) [14:52:29] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:52:30] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:52:32] jouncebot: nowandnext [14:52:33] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [14:52:33] In 0 hour(s) and 7 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1500) [14:52:44] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:52:57] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:53:19] (03PS1) 10Brouberol: Fix typo in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129850 (https://phabricator.wikimedia.org/T386282) [14:54:22] (03CR) 10Filippo Giunchedi: "Can't vote with confidence, sorry!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129824 (https://phabricator.wikimedia.org/T389480) (owner: 10Effie Mouzeli) [14:54:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657316 (10phaultfinder) [14:57:20] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: try rolling operation without allow-yellow flag - bking@cumin2002 - T389119 [14:57:25] T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119 [14:57:50] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:58:01] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:58:40] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided) [14:59:12] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided) (duration: 00m 33s) [14:59:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:59:53] (03CR) 10Brouberol: [C:03+2] Fix typo in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129850 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [15:00:05] jnuche and jeena: May I have your attention please! Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1500) [15:00:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [15:00:38] (03CR) 10Elukey: [C:03+1] Switch ganeti-test2001 to EFI [puppet] - 10https://gerrit.wikimedia.org/r/1129832 (https://phabricator.wikimedia.org/T382515) (owner: 10Muehlenhoff) [15:01:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [15:01:42] (03PS6) 10BCornwall: cdn: Add roll-upgrade-varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) [15:01:52] (03CR) 10BCornwall: cdn: Add roll-upgrade-varnish (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall) [15:02:42] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided) [15:03:28] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided) (duration: 00m 51s) [15:03:29] not sure who added storcli but note: https://puppetboard.wikimedia.org/failures [15:03:40] E: Problem with MergeList /var/lib/apt/lists/apt.wikimedia.org_wikimedia_dists_bookworm-wikimedia_thirdparty_hwraid_binary-amd64_Packages [15:03:43] E: The package lists or status file could not be parsed or opened. [15:03:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472#10657360 (10Jhancock.wm) okay since this has happened before i pulled DIMM_B1 to see if it would boot without it. Got the same error on DIMM_B2. moved it to DIMM_B1. error move... [15:03:49] this is causing a widespread puppet failure [15:04:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472#10657361 (10Jhancock.wm) a:03Jhancock.wm [15:04:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657363 (10phaultfinder) [15:04:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:06:26] ^ should recover soon. I rolled back, this was caused by https://phabricator.wikimedia.org/T388628#10657364 [15:06:28] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:06:36] ah thank you [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:46] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:11:31] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:11:32] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:11:39] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:11:49] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:14:42] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: try rolling operation without allow-yellow flag - bking@cumin2002 - T389119 [15:14:47] T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119 [15:19:10] (03PS1) 10BCornwall: upgrade cp3067 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129854 (https://phabricator.wikimedia.org/T378737) [15:19:11] (03PS1) 10BCornwall: upgrade cp3068 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129855 (https://phabricator.wikimedia.org/T378737) [15:19:13] (03PS1) 10BCornwall: upgrade cp3069 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129856 (https://phabricator.wikimedia.org/T378737) [15:19:14] (03PS1) 10BCornwall: upgrade cp3070 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129857 (https://phabricator.wikimedia.org/T378737) [15:19:16] (03PS1) 10BCornwall: upgrade cp3071 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129858 (https://phabricator.wikimedia.org/T378737) [15:19:17] (03PS1) 10BCornwall: upgrade cp3072 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129859 (https://phabricator.wikimedia.org/T378737) [15:19:19] (03PS1) 10BCornwall: upgrade cp3073 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129860 (https://phabricator.wikimedia.org/T378737) [15:19:23] (03PS1) 10BCornwall: upgrade cp3075 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129861 (https://phabricator.wikimedia.org/T378737) [15:19:27] (03PS1) 10BCornwall: upgrade cp3076 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129862 (https://phabricator.wikimedia.org/T378737) [15:19:31] (03PS1) 10BCornwall: upgrade cp3077 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129863 (https://phabricator.wikimedia.org/T378737) [15:19:35] (03PS1) 10BCornwall: upgrade cp3078 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129864 (https://phabricator.wikimedia.org/T378737) [15:19:39] (03PS1) 10BCornwall: upgrade cp3079 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129865 (https://phabricator.wikimedia.org/T378737) [15:19:43] (03PS1) 10BCornwall: upgrade cp3080 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129866 (https://phabricator.wikimedia.org/T378737) [15:19:47] (03PS1) 10BCornwall: upgrade cp3081 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129867 (https://phabricator.wikimedia.org/T378737) [15:20:12] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudelastic[1007,1009-1012].eqiad.wmnet with reason: troubleshooting red status [15:20:52] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti-test2001 to EFI [puppet] - 10https://gerrit.wikimedia.org/r/1129832 (https://phabricator.wikimedia.org/T382515) (owner: 10Muehlenhoff) [15:21:38] (03PS1) 10Andrew Bogott: typos: add 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1129868 [15:25:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657504 (10phaultfinder) [15:27:18] (03CR) 10Ssingh: [C:03+1] upgrade cp3067 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129854 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:27:20] (03CR) 10Ssingh: [C:03+1] upgrade cp3068 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129855 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:27:27] (03CR) 10Ssingh: [C:03+1] upgrade cp3069 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129856 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:27:36] (03CR) 10Ssingh: [C:03+1] upgrade cp3070 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129857 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:27:40] (03CR) 10Ssingh: [C:03+1] upgrade cp3071 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129858 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:27:45] (03CR) 10Ssingh: [C:03+1] upgrade cp3072 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129859 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:27:53] (03CR) 10Ssingh: [C:03+1] upgrade cp3073 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129860 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:28:01] (03CR) 10Ssingh: [C:03+1] upgrade cp3075 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129861 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:28:10] (03CR) 10Ssingh: [C:03+1] upgrade cp3076 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129862 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:28:12] (03CR) 10Ssingh: [C:03+1] upgrade cp3077 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129863 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:28:22] (03CR) 10Ssingh: [C:03+1] upgrade cp3078 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129864 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:28:25] (03CR) 10Ssingh: [C:03+1] upgrade cp3079 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129865 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:28:30] (03CR) 10Ssingh: [C:03+1] upgrade cp3080 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129866 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:28:43] (03CR) 10Ssingh: [C:03+1] upgrade cp3081 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129867 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:29:17] !log jmm@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:29:39] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:32:21] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [15:32:28] jouncebot: nowandnext [15:32:28] For the next 0 hour(s) and 27 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1500) [15:32:28] In 0 hour(s) and 27 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1600) [15:34:45] RESOLVED: [3x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:36:08] !log cgoubert@deploy2002 Started scap sync-world: Build mediawiki-cli image - T389484 [15:36:14] T389484: Create a mediawiki-cli image - https://phabricator.wikimedia.org/T389484 [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:25] (03CR) 10Eevans: [C:03+2] restbase: commission restbase1043 (refresh for restbase1028) [puppet] - 10https://gerrit.wikimedia.org/r/1129377 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [15:38:46] (03CR) 10BCornwall: [C:03+2] upgrade cp3067 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129854 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:39:25] (03CR) 10Ssingh: [C:03+1] cdn: Add roll-upgrade-varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall) [15:39:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:42:27] !log cgoubert@deploy2002 Finished scap sync-world: Build mediawiki-cli image - T389484 (duration: 06m 18s) [15:42:31] T389484: Create a mediawiki-cli image - https://phabricator.wikimedia.org/T389484 [15:45:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657573 (10phaultfinder) [15:47:35] !log installing node-postcss security updates [15:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:25] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2248.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:48:38] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2248.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:49:24] (03CR) 10Pppery: "This doesn't seem like the correct analysis of the cause - the maintenance script runs as "flow talk page manager", which should already h" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129795 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [15:52:27] FIRING: [6x] ProbeDown: Service restbase1043-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:56] (03PS1) 10Bking: relforge: move relforge1003 into OpenSearch role [puppet] - 10https://gerrit.wikimedia.org/r/1129877 (https://phabricator.wikimedia.org/T380752) [15:54:18] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:54:31] (03PS2) 10Bking: relforge: move relforge1003 into OpenSearch role [puppet] - 10https://gerrit.wikimedia.org/r/1129877 (https://phabricator.wikimedia.org/T380752) [15:55:55] (03PS1) 10Muehlenhoff: testreduce: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/1129878 (https://phabricator.wikimedia.org/T329529) [15:56:43] (03CR) 10BCornwall: [C:03+2] cdn: Add roll-upgrade-varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall) [15:57:37] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129877 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [15:57:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129878 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [15:58:46] (03CR) 10Clément Goubert: [C:03+1] thumbor: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129824 (https://phabricator.wikimedia.org/T389480) (owner: 10Effie Mouzeli) [15:58:47] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2250 to codfw - jhancock@cumin2002" [15:58:50] (03CR) 10Clément Goubert: [C:03+1] mw-mcrouter: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129816 (owner: 10Effie Mouzeli) [16:00:05] jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1600). [16:00:05] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2250 to codfw - jhancock@cumin2002" [16:00:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:00:23] (03CR) 10DLynch: "The errors we got from running the script were clearly saying that the flow-create-board permission was missing, though. It could certainl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129795 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [16:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [16:01:48] o/ [16:02:01] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2250 [16:02:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2250 [16:02:14] sorry i accidentally grafana, one sec [16:03:02] back [16:03:52] (03CR) 10Clément Goubert: [C:03+1] mw-on-k8s: align alerts with "pools" of capacity [alerts] - 10https://gerrit.wikimedia.org/r/1129358 (https://phabricator.wikimedia.org/T389224) (owner: 10Scott French) [16:06:14] (03CR) 10Vgutierrez: cdn: Add roll-upgrade-varnish (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall) [16:07:12] !log stop imposm on maps1009 to allow fixing the postgres db - T389462 [16:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:16] T389462: OSM replication lag on maps1009 - https://phabricator.wikimedia.org/T389462 [16:07:27] (03CR) 10Clément Goubert: [C:03+1] testreduce: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/1129878 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [16:08:22] (03CR) 10Filippo Giunchedi: "reverse-proxying https with mod_proxy is possible, a change similar to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1125990 is nee" [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) (owner: 10Phedenskog) [16:09:09] godog: the whole grafana? [16:09:14] !log eevans@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1043.eqiad.wmnet with reason: Bootstrapping — T389423 [16:09:18] T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423 [16:09:50] !log dancy@deploy2002 Installing scap version "4.142.0" for 193 host(s) [16:10:20] (03CR) 10Effie Mouzeli: [C:03+2] thumbor: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129824 (https://phabricator.wikimedia.org/T389480) (owner: 10Effie Mouzeli) [16:10:22] claime: the whole apache to be exact [16:10:25] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2250.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:10:27] rookie mistake [16:10:31] damn [16:10:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657721 (10phaultfinder) [16:11:12] ikr? [16:11:25] FIRING: SystemdUnitFailed: imposm.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:11:39] (03CR) 10Effie Mouzeli: [C:03+1] mw-on-k8s: align alerts with "pools" of capacity [alerts] - 10https://gerrit.wikimedia.org/r/1129358 (https://phabricator.wikimedia.org/T389224) (owner: 10Scott French) [16:13:41] !log bootstrapping restbase1034-a/cassandra — T389423 [16:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:21] !log dancy@deploy2002 Installation of scap version "4.142.0" completed for 193 hosts [16:14:39] FIRING: [6x] ProbeDown: Service restbase1043-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:18:04] (03PS1) 10Brouberol: Use abspaths when sub-processing dumps commands [dumps] - 10https://gerrit.wikimedia.org/r/1129880 (https://phabricator.wikimedia.org/T388378) [16:18:56] !log `ALTER TABLE public.wikidata_relation_members ALTER COLUMN id TYPE bigint;` on maps1009's posgres - T389462 [16:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:00] T389462: OSM replication lag on maps1009 - https://phabricator.wikimedia.org/T389462 [16:20:59] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2250.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:21:54] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2250.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:22:21] (03CR) 10Btullis: [C:03+1] "Thanks" [dumps] - 10https://gerrit.wikimedia.org/r/1129880 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [16:26:25] FIRING: [5x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:26:55] (03Merged) 10jenkins-bot: thumbor: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129824 (https://phabricator.wikimedia.org/T389480) (owner: 10Effie Mouzeli) [16:27:07] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2250.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:30:12] (03PS1) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 [16:31:25] FIRING: [5x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:32:46] (03PS1) 10BCornwall: cdn: Restart varnishmtail on Varnish upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1129884 [16:34:34] (03CR) 10Vgutierrez: [C:03+1] cdn: Restart varnishmtail on Varnish upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1129884 (owner: 10BCornwall) [16:36:36] (03CR) 10BCornwall: [C:03+2] cdn: Restart varnishmtail on Varnish upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1129884 (owner: 10BCornwall) [16:36:41] (03CR) 10BCornwall: [V:03+2 C:03+2] cdn: Restart varnishmtail on Varnish upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1129884 (owner: 10BCornwall) [16:37:43] (03Abandoned) 10BCornwall: sre.cdn.roll-upgrade-haproxy: migrate to SRELBBatchRunnerBaseCDN [cookbooks] - 10https://gerrit.wikimedia.org/r/925681 (owner: 10Jbond) [16:40:31] (03CR) 10Brouberol: [V:03+2 C:03+2] Use abspaths when sub-processing dumps commands [dumps] - 10https://gerrit.wikimedia.org/r/1129880 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [16:41:03] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3067.esams.wmnet} and A:cp [16:41:09] !log brouberol@deploy2002 Started scap build-images: (no justification provided) [16:41:39] !log brouberol@deploy2002 Finished scap build-images: (no justification provided) (duration: 00m 30s) [16:41:54] (03PS1) 10Elukey: maps: fix id type for the table wikidata_relation_members in imposm_mapping [puppet] - 10https://gerrit.wikimedia.org/r/1129886 (https://phabricator.wikimedia.org/T389462) [16:41:58] !log Upgrading varnish to 7.1 on cp3067 (T378737) [16:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:02] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [16:42:02] since I forgot a --reason :/ [16:42:11] (03PS1) 10Jgiannelos: imposm: Change mapping to use bigint for column `id` [puppet] - 10https://gerrit.wikimedia.org/r/1129888 (https://phabricator.wikimedia.org/T389462) [16:42:26] (03CR) 10Jgiannelos: [C:03+1] maps: fix id type for the table wikidata_relation_members in imposm_mapping [puppet] - 10https://gerrit.wikimedia.org/r/1129886 (https://phabricator.wikimedia.org/T389462) (owner: 10Elukey) [16:42:57] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [16:43:43] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [16:44:30] (03PS1) 10DCausse: cirrus: update alerts based on rc0 topics [alerts] - 10https://gerrit.wikimedia.org/r/1129889 (https://phabricator.wikimedia.org/T375821) [16:45:12] (03Abandoned) 10Jgiannelos: imposm: Change mapping to use bigint for column `id` [puppet] - 10https://gerrit.wikimedia.org/r/1129888 (https://phabricator.wikimedia.org/T389462) (owner: 10Jgiannelos) [16:45:35] (03CR) 10Elukey: [C:03+2] maps: fix id type for the table wikidata_relation_members in imposm_mapping [puppet] - 10https://gerrit.wikimedia.org/r/1129886 (https://phabricator.wikimedia.org/T389462) (owner: 10Elukey) [16:45:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657869 (10phaultfinder) [16:45:39] (03CR) 10DCausse: "needs to be merged right after I05b8375" [alerts] - 10https://gerrit.wikimedia.org/r/1129889 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [16:45:48] (03CR) 10DCausse: [C:04-2] cirrus: update alerts based on rc0 topics [alerts] - 10https://gerrit.wikimedia.org/r/1129889 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [16:46:28] !log imported haproxykafka 0.3.6 into apt repository (added TimestampType) (T388397) [16:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:32] T388397: Fix `webrequest_frontend` kafka timestamp mismatch with in-data `dt` field - https://phabricator.wikimedia.org/T388397 [16:48:23] !log upgrade haproxykafka to 0.3.6 on A:cp (gradual rollout) [16:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:29] joal ^^ [16:49:01] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3067.esams.wmnet} and A:cp [16:52:22] (03PS1) 10Effie Mouzeli: Revert "thumbor: use monitoring.named_ports" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129890 [16:52:47] !log brouberol@deploy2002 Started scap build-images: Rebuild mediawiki-cli with recent dumps abspath fix - T388378 [16:52:53] T388378: Orchestrate dumps v1 from an airflow instance - https://phabricator.wikimedia.org/T388378 [16:53:12] !log brouberol@deploy2002 Finished scap build-images: Rebuild mediawiki-cli with recent dumps abspath fix - T388378 (duration: 00m 24s) [16:53:26] (03PS1) 10BCornwall: sre.cdn.roll-upgrade-varnish: Fix package parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/1129891 [16:53:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:54:42] (03CR) 10CI reject: [V:04-1] cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [16:54:50] FIRING: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:55:18] this is under maintenace --^ but it should be silenced [16:55:52] (03CR) 10Effie Mouzeli: [C:03+2] Revert "thumbor: use monitoring.named_ports" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129890 (owner: 10Effie Mouzeli) [16:56:02] ah no right I wasn't able via cookbook since the host was wiped by reimage [16:56:07] lemme try to add something manually [16:57:43] (03Merged) 10jenkins-bot: Revert "thumbor: use monitoring.named_ports" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129890 (owner: 10Effie Mouzeli) [16:58:14] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3068.esams.wmnet} and A:cp [16:59:51] !log brouberol@deploy2002 Started scap build-images: Rebuild mediawiki-cli with recent dumps abspath fix (w/o cache) - T388378 [16:59:55] T388378: Orchestrate dumps v1 from an airflow instance - https://phabricator.wikimedia.org/T388378 [17:00:05] bd808: OwO what's this, a deployment window?? Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1700). nyaa~ [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1700) [17:00:29] (03CR) 10BCornwall: [C:03+2] cdn: Add roll-upgrade-varnish (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall) [17:00:47] !log brouberol@deploy2002 Finished scap build-images: Rebuild mediawiki-cli with recent dumps abspath fix (w/o cache) - T388378 (duration: 00m 56s) [17:01:05] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) (owner: 10Filippo Giunchedi) [17:02:30] (03CR) 10BCornwall: [C:03+2] upgrade cp3068 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129855 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:04:59] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1129173 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [17:05:21] (03CR) 10Bking: [C:03+2] "self-merging, as this does not affect production hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1129877 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [17:05:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10657999 (10phaultfinder) [17:06:02] (03PS1) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) [17:06:25] RESOLVED: SystemdUnitFailed: imposm.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:07:52] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1129128 (https://phabricator.wikimedia.org/T389072) (owner: 10Filippo Giunchedi) [17:08:25] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10658021 (10MatthewVernon) That was (suspiciously) easy to re-add, but I notice there's no `megacli` available on this system,... [17:08:55] (03PS2) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 [17:13:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:17:11] (03PS3) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 [17:18:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:19:54] (03PS1) 10Reedy: Sanitizer::normalizeWhitespace: simplify redundant preg_replace [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129895 (https://phabricator.wikimedia.org/T388733) [17:22:07] jouncebot: nowandnext [17:22:07] For the next 0 hour(s) and 37 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1700) [17:22:07] For the next 0 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1700) [17:22:07] In 0 hour(s) and 37 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1800) [17:22:24] (03CR) 10Andrew Bogott: [C:03+2] typos: add 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1129868 (owner: 10Andrew Bogott) [17:22:33] (03CR) 10Reedy: [C:03+2] Sanitizer::normalizeWhitespace: simplify redundant preg_replace [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129895 (https://phabricator.wikimedia.org/T388733) (owner: 10Reedy) [17:23:01] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1003* for ban host prior to reimage - bking@cumin2002 - T380752 [17:23:02] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1003* for ban host prior to reimage - bking@cumin2002 - T380752 [17:23:05] T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752 [17:26:37] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3068.esams.wmnet} and A:cp [17:26:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10658127 (10phaultfinder) [17:27:02] (03CR) 10CI reject: [V:04-1] cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [17:28:25] FIRING: SystemdUnitFailed: opensearch-disable-readahead.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:29:49] (03CR) 10BCornwall: [C:04-1] "The code might be good but I think we could give some more background/meaning behind the commit in the message." [puppet] - 10https://gerrit.wikimedia.org/r/1129349 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [17:30:02] (03CR) 10BCornwall: [C:04-1] "Marking unresolved." [puppet] - 10https://gerrit.wikimedia.org/r/1129349 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [17:33:25] FIRING: [4x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:34:42] (03PS19) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) [17:36:37] (03Merged) 10jenkins-bot: Sanitizer::normalizeWhitespace: simplify redundant preg_replace [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129895 (https://phabricator.wikimedia.org/T388733) (owner: 10Reedy) [17:37:57] (03CR) 10Ssingh: [C:03+1] "I am curious, what was broken here? +1 if it works but still curious." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129891 (owner: 10BCornwall) [17:38:10] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdd) failed on ms-be1081 - https://phabricator.wikimedia.org/T388697#10658230 (10VRiley-WMF) @MatthewVernon Thanks for the heads up. This disk has been replaced using one of those spares! Still awaiting on Dell to send out the replacment. However, ma... [17:38:22] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdd) failed on ms-be1081 - https://phabricator.wikimedia.org/T388697#10658231 (10VRiley-WMF) 05Open→03Resolved [17:38:25] RESOLVED: [4x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:39:01] (03CR) 10Ssingh: [C:03+1] "Ahhhh ok nvm, I see it now." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129891 (owner: 10BCornwall) [17:39:56] (03CR) 10Scott French: "Thank you both for the review!" [alerts] - 10https://gerrit.wikimedia.org/r/1129358 (https://phabricator.wikimedia.org/T389224) (owner: 10Scott French) [17:40:17] (03CR) 10Scott French: [C:03+2] mw-on-k8s: align alerts with "pools" of capacity [alerts] - 10https://gerrit.wikimedia.org/r/1129358 (https://phabricator.wikimedia.org/T389224) (owner: 10Scott French) [17:40:25] FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:41:44] (03Merged) 10jenkins-bot: mw-on-k8s: align alerts with "pools" of capacity [alerts] - 10https://gerrit.wikimedia.org/r/1129358 (https://phabricator.wikimedia.org/T389224) (owner: 10Scott French) [17:44:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10658330 (10phaultfinder) [17:44:41] (03CR) 10Dreamy Jazz: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [17:45:25] FIRING: [6x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:47:11] (03CR) 10Ssingh: "I think this is a good idea and much cleaner. The only question I have is if you know why we had the specific version field for ATS. I can" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [17:48:28] FIRING: SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on relforge1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:48:29] (03PS1) 10Cwhite: add statsv throughput alerts [alerts] - 10https://gerrit.wikimedia.org/r/1129899 (https://phabricator.wikimedia.org/T389469) [17:49:26] (03PS3) 10Gergő Tisza: varnish: Fix X-Wikimedia-Debug cookie handling [puppet] - 10https://gerrit.wikimedia.org/r/1129349 (https://phabricator.wikimedia.org/T350094) [17:50:20] (03CR) 10BCornwall: "Honestly, I don't recall at all, and I don't see it being useful for the purposes of these cookbooks since their goal is to roll out new v" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [17:50:25] FIRING: [7x] SystemdUnitFailed: elasticsearch_7@relforge-eqiad-small-alpha.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:50:38] (03CR) 10Gergő Tisza: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1129349 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [17:51:22] !log sudo cumin 'A:cp-text' 'disable-puppet "rolling out CR 1129349"': T350094 [17:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:26] T350094: Enable verbose logging without installing the WikimediaDebug extension - https://phabricator.wikimedia.org/T350094 [17:52:56] (03CR) 10Ssingh: [C:03+2] varnish: Fix X-Wikimedia-Debug cookie handling [puppet] - 10https://gerrit.wikimedia.org/r/1129349 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [17:53:43] !log reedy@deploy2002 Synchronized php-1.44.0-wmf.21/includes/parser/Sanitizer.php: T388733 (duration: 11m 36s) [17:53:47] T388733: PHP Warning: MediaWiki\Parser\Sanitizer::normalizeWhitespace: Failed to normalize whitespace: 6 [Called from MediaWiki\Parser\Sanitizer::normalizeWhitespace in /srv/mediawiki/php-1.44.0-wmf.20/includes/parser/Sanitizer.php - https://phabricator.wikimedia.org/T388733 [17:54:15] (03PS4) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 [17:54:21] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [17:55:25] FIRING: [6x] SystemdUnitFailed: elasticsearch_7@relforge-eqiad-small-alpha.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:56:23] (03CR) 10Ssingh: "@rcoccioli@wikimedia.org: any thoughts on the unification part? Brett's current approach would make it cleaner to have one cookbook but he" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [17:57:35] !log enable puppet and run agent on cp3071 to test CR 1129349 [17:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] jnuche and jeena: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1800). nyaa~ [18:00:25] FIRING: [5x] SystemdUnitFailed: elasticsearch_7@relforge-eqiad-small-alpha.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:02:18] !log sudo cumin -b11 'A:cp-text' 'enable-puppet-agent "rolling out CR 1129349"': T350094 [18:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:22] T350094: Enable verbose logging without installing the WikimediaDebug extension - https://phabricator.wikimedia.org/T350094 [18:02:48] !log sudo cumin -b11 'A:cp-text' 'run-puppet-agent "rolling out CR 1129349"': T350094 [18:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:28] RESOLVED: SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on relforge1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:08:04] (03CR) 10BCornwall: "Specifically, the complexity of a single cookbook would increase quite a bit, starting with having to pair the service with associated pac" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [18:08:20] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10658621 (10phaultfinder) [18:08:56] (03PS1) 10Gergő Tisza: Enable SUL3 logins for 50% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129905 (https://phabricator.wikimedia.org/T384153) [18:09:00] (03CR) 10BCornwall: "For posterity, the join() was creating one long comma-separated string that was then passed to apt, e.g. `apt-get install foo,bar,baz`)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129891 (owner: 10BCornwall) [18:09:04] (03CR) 10BCornwall: [C:03+2] sre.cdn.roll-upgrade-varnish: Fix package parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/1129891 (owner: 10BCornwall) [18:09:53] (03CR) 10Ssingh: "Yeah I am fine with merging this, unless volans can suggest a clean way of handling it. (He usually has surprises)." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [18:10:25] RESOLVED: [5x] SystemdUnitFailed: elasticsearch_7@relforge-eqiad-small-alpha.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:10:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129905 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [18:10:48] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3069.esams.wmnet} and A:cp [18:12:01] (03CR) 10BCornwall: [C:03+2] upgrade cp3069 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129856 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:12:21] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in relforge [18:12:21] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in relforge [18:14:35] (03CR) 10Giuseppe Lavagetto: [C:03+1] hieradata: migrate mw-misc to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1128923 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:14:49] (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-misc: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128924 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:14:58] 06SRE, 10SRE-Access-Requests: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529 (10Cpetrillo) 03NEW [18:15:55] FIRING: [5x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:11] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3069.esams.wmnet} and A:cp [18:20:55] FIRING: [5x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:22:35] 06SRE, 10SRE-Access-Requests: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529#10658717 (10Milimetric) approved as I am authorized to do per [[ https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/admin/d... [18:22:45] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10658718 (10BCornwall) Were they able to get back to you, @RobH ? [18:24:12] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10658719 (10BCornwall) Hi, @RobH, has this been able to be looked at? It's been depooled for a while now. Thanks! [18:24:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on relforge1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:25:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10658743 (10phaultfinder) [18:25:55] RESOLVED: [5x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:28:36] 06SRE, 10SRE-Access-Requests: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529#10658759 (10ssingh) [18:31:12] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10658770 (10RobH) Working on it now, pulling reports from idrac for case. [18:31:55] FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:32:10] RESOLVED: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:36:35] jouncebot: nowandnext [18:36:35] For the next 1 hour(s) and 23 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1800) [18:36:36] In 1 hour(s) and 23 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2000) [18:36:55] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:38:23] jeena: I see that the train rolled to group2 in the earlier window. would you have any objections if I were to deploy some changes during this window? (one last PHP 8.1 switch) [18:38:44] swfrench-wmf: yes that would be fine afaik [18:38:58] jeena: great, thank you! [18:39:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on relforge1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:40:04] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1128923 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:40:08] (03CR) 10Scott French: [C:03+2] hieradata: migrate mw-misc to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1128923 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:41:21] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10658820 (10RobH) Yes, and they saw no temp errors in their investigation of the logs. I'll flag this and dump their updates to this task later this week. [18:42:04] (03CR) 10Scott French: [C:03+2] mw-misc: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128924 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:43:32] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1129899 (https://phabricator.wikimedia.org/T389469) (owner: 10Cwhite) [18:43:46] (03Merged) 10jenkins-bot: mw-misc: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128924 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:45:47] 06SRE, 10SRE-Access-Requests: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529#10658825 (10ssingh) [18:46:27] 06SRE, 10SRE-Access-Requests: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529#10658838 (10ssingh) @lanebecker: this requires your approval, thanks. (Thanks @Milimetric) [18:46:55] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:48:36] !log swfrench@deploy2002 Started scap sync-world: Switch mw-misc to PHP 8.1 - T383845 [18:48:40] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:49:28] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on relforge1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:51:05] !log swfrench@deploy2002 Finished scap sync-world: Switch mw-misc to PHP 8.1 - T383845 (duration: 03m 22s) [18:51:14] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1003* for ban host to test reimage - bking@cumin2002 - T380752 [18:51:14] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1003* for ban host to test reimage - bking@cumin2002 - T380752 [18:51:19] T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752 [18:51:22] 06SRE, 10SRE-Access-Requests: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529#10658849 (10lanebecker) Dropping in from holiday mode to approve. Approved! [18:51:55] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:52:38] (03CR) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [18:53:32] (03PS1) 10Ssingh: admin: add cpetrillo to analytics-privatedata-users (with krb) [puppet] - 10https://gerrit.wikimedia.org/r/1129913 (https://phabricator.wikimedia.org/T389529) [18:54:26] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529#10658881 (10ssingh) [18:54:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10658884 (10phaultfinder) [18:55:41] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10658893 (10RobH) [18:56:55] RESOLVED: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:57:31] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in relforge [18:57:31] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in relforge [18:59:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10658895 (10phaultfinder) [19:00:25] FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:01:19] (03CR) 10RLazarus: [C:03+1] admin: add cpetrillo to analytics-privatedata-users (with krb) [puppet] - 10https://gerrit.wikimedia.org/r/1129913 (https://phabricator.wikimedia.org/T389529) (owner: 10Ssingh) [19:01:52] (03PS1) 10Jforrester: AbstractIterator: Make PHP 8.1 compatible [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129914 (https://phabricator.wikimedia.org/T389515) [19:02:29] (03CR) 10BCornwall: [C:03+2] upgrade cp3070 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129857 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:02:46] FYI, barring any surprises with mw-misc, this concludes my changes [19:03:28] (03CR) 10Ssingh: [C:03+2] admin: add cpetrillo to analytics-privatedata-users (with krb) [puppet] - 10https://gerrit.wikimedia.org/r/1129913 (https://phabricator.wikimedia.org/T389529) (owner: 10Ssingh) [19:03:45] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3070.esams.wmnet} and A:cp [19:04:39] FIRING: [5x] ProbeDown: Service restbase1043-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:07:21] RESOLVED: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [19:07:21] !log dancy@deploy2002 Installing scap version "4.143.0" for 193 host(s) [19:07:27] FIRING: [6x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:08:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [19:09:39] FIRING: [8x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:10:18] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3070.esams.wmnet} and A:cp [19:11:50] !log dancy@deploy2002 Installation of scap version "4.143.0" completed for 193 hosts [19:11:52] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to SSH access, analytics-privatedata-users, for CPetrillo-WMF - https://phabricator.wikimedia.org/T389529#10658944 (10ssingh) 05Open→03Resolved a:03ssingh ` sukhe@krb1001:~$ sudo manage_principals.py create cpetrillo --email_addres... [19:13:21] !log dancy@deploy2002 Started scap sync-world: T388761 [19:13:22] (03PS1) 10Bking: relforge: add relforge1004 as master eligible [puppet] - 10https://gerrit.wikimedia.org/r/1129915 (https://phabricator.wikimedia.org/T380752) [19:13:25] T388761: scap needs to be k8s-cluster aware - https://phabricator.wikimedia.org/T388761 [19:13:41] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T389537 (10phaultfinder) 03NEW [19:14:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10658968 (10phaultfinder) [19:15:08] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129915 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [19:15:25] FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:16:47] !log restarting prometheus@ops.service in prometheus1005 [19:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:19] thanks denisse! [19:17:21] (03CR) 10Bking: [C:03+2] relforge: add relforge1004 as master eligible [puppet] - 10https://gerrit.wikimedia.org/r/1129915 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [19:17:33] (03CR) 10Bking: [C:03+2] "self-merging, as this does not touch production" [puppet] - 10https://gerrit.wikimedia.org/r/1129915 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [19:18:20] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [19:18:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [19:21:47] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1004* for ban host to test reimage - bking@cumin2002 - T380752 [19:21:48] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1004* for ban host to test reimage - bking@cumin2002 - T380752 [19:21:51] T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752 [19:22:27] FIRING: [8x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:24:37] !log dancy@deploy2002 Finished scap sync-world: T388761 (duration: 11m 15s) [19:24:41] T388761: scap needs to be k8s-cluster aware - https://phabricator.wikimedia.org/T388761 [19:25:25] RESOLVED: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:26:33] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in relforge [19:26:34] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in relforge [19:27:33] (03CR) 10Dzahn: [C:03+1] "lgtm, if it's in an order like on https://phabricator.wikimedia.org/T326368 and the "profile::gerrit::active_host" is also changed in Hier" [dns] - 10https://gerrit.wikimedia.org/r/1128818 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [19:27:34] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1003* for ban host to test puppet code - bking@cumin2002 - T380752 [19:27:35] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1003* for ban host to test puppet code - bking@cumin2002 - T380752 [19:27:37] T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752 [19:28:40] (03PS2) 10Gergő Tisza: Clear stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) [19:29:30] (03CR) 10CI reject: [V:04-1] Clear stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) (owner: 10Gergő Tisza) [19:31:55] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:32:18] (03CR) 10Dzahn: gerrit: switchover to gerrit2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1128814 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [19:33:28] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in relforge [19:33:28] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in relforge [19:33:43] (03CR) 10Dzahn: [C:03+1] "generally looks fine to me, it's just about the order of things. so.. first disable gerrit on source.. then sync lfs data one last time.. " [puppet] - 10https://gerrit.wikimedia.org/r/1128814 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [19:35:31] (03PS3) 10Gergő Tisza: Clear stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) [19:38:20] jouncebot: nowandnext [19:38:20] For the next 0 hour(s) and 21 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T1800) [19:38:20] In 0 hour(s) and 21 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2000) [19:42:10] RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9400.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:42:36] (03CR) 10BCornwall: [C:03+2] upgrade cp3071 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129858 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:43:22] 10SRE-swift-storage: Swift file replicated to codfw but not eqiad - https://phabricator.wikimedia.org/T389539 (10Dylsss) 03NEW [19:44:05] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3071.esams.wmnet} and A:cp [19:46:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:46:39] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [19:47:57] (03PS1) 10Dzahn: gerrit: replace hardcoded host name and codfw string for replica [puppet] - 10https://gerrit.wikimedia.org/r/1129919 [19:48:21] (03CR) 10CI reject: [V:04-1] gerrit: replace hardcoded host name and codfw string for replica [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (owner: 10Dzahn) [19:48:23] (03CR) 10Bartosz Dziewoński: [C:03+1] Clear stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) (owner: 10Gergő Tisza) [19:50:01] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1116 [19:50:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:50:37] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3071.esams.wmnet} and A:cp [19:50:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10659058 (10phaultfinder) [19:50:55] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for elastic - jclark@cumin1002" [19:51:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for elastic - jclark@cumin1002" [19:51:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:51:09] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1117 [19:51:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:51:20] !log jclark@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host elastic1117 [19:51:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1116 [19:51:29] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1117 [19:52:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1117 [19:52:59] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1118 [19:54:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1118 [19:54:09] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1119 [19:55:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1119 [19:55:19] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1120 [19:55:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:56:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1120 [19:56:57] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1121 [19:57:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) (owner: 10Gergő Tisza) [19:58:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1121 [19:58:30] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1122 [19:59:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:59:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1122 [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2000) [20:00:05] inflatador, cwhite, Superpes, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] .o/ [20:00:24] o/ [20:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [20:02:05] o/ [20:02:40] (03PS1) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) [20:02:47] (03PS2) 10Dzahn: gerrit: replace hardcoded host name and codfw string for replica [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (https://phabricator.wikimedia.org/T387833) [20:03:12] (03CR) 10CI reject: [V:04-1] gerrit: replace hardcoded host name and codfw string for replica [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [20:03:35] (03CR) 10Dzahn: "hrmm,, something is wrong with the syntax but "$first_element = lookup('my_array')[0]" is supposed to be it. Anyways.. for now just presen" [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [20:03:40] (03CR) 10Dzahn: [C:04-1] gerrit: replace hardcoded host name and codfw string for replica [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [20:04:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:04:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10659093 (10Jclark-ctr) [20:05:21] (03CR) 10Dzahn: "I would like to avoid that we always have to replace hardcoded host names in multiple places every time we switch.. this is just a first i" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [20:07:53] I can deploy [20:08:28] (03CR) 10Dzahn: [V:03+1 C:03+2] "query tested (71 rows in 0.128 seconds)" [puppet] - 10https://gerrit.wikimedia.org/r/1129806 (https://phabricator.wikimedia.org/T380300) (owner: 10Aklapper) [20:09:55] (03PS1) 10Krinkle: docroot: Enable Chrome credential sharing on foundation.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129922 (https://phabricator.wikimedia.org/T385520) [20:10:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129181 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse) [20:11:55] (03CR) 10Gergő Tisza: [C:03+1] docroot: Enable Chrome credential sharing on foundation.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129922 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [20:13:06] (03Merged) 10jenkins-bot: cirrus: explicitly route search traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129181 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse) [20:13:25] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1129181|cirrus: explicitly route search traffic to eqiad (T388610)]] [20:13:29] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [20:18:22] !log tgr@deploy2002 dcausse, tgr: Backport for [[gerrit:1129181|cirrus: explicitly route search traffic to eqiad (T388610)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:20:16] (03CR) 10BCornwall: [C:03+2] upgrade cp3072 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129859 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:21:29] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3072.esams.wmnet} and A:cp [20:21:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10659176 (10phaultfinder) [20:25:07] inflatador: do you want to test it? [20:26:15] tgr_ 1 sec [20:26:39] tgr_ no, we're good [20:27:02] !log tgr@deploy2002 dcausse, tgr: Continuing with sync [20:27:26] (03PS2) 10Jforrester: AbstractIterator: Make PHP 8.1 compatible [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129914 (https://phabricator.wikimedia.org/T389515) [20:27:35] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3072.esams.wmnet} and A:cp [20:27:45] (03CR) 10Jforrester: "Re-cherry-picked now that the patch has landed in master, so we get the nice blame git hash." [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129914 (https://phabricator.wikimedia.org/T389515) (owner: 10Jforrester) [20:28:45] (03CR) 10Dzahn: "I see T381417 is now resolved. How about the status of this now?" [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [20:29:55] (03CR) 10Dzahn: create a namespace for codesearch on k8s-aux cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [20:31:13] (03PS1) 10Cwhite: es_exporter: constrain wikifunctions query [puppet] - 10https://gerrit.wikimedia.org/r/1129927 (https://phabricator.wikimedia.org/T388174) [20:32:40] (03CR) 10Dzahn: [C:03+1] "done and done. I should be able to deploy the new namespace at any time. Docs per Alex: https://wikitech.wikimedia.org/wiki/Kubernetes/Add" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [20:32:45] (03PS2) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) [20:34:32] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129181|cirrus: explicitly route search traffic to eqiad (T388610)]] (duration: 21m 07s) [20:34:36] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [20:35:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite) [20:37:53] (03PS3) 10Reedy: AbstractIterator: Make PHP 8.1 compatible [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129914 (https://phabricator.wikimedia.org/T389515) (owner: 10Jforrester) [20:39:03] (03PS20) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) [20:39:53] (03CR) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [20:41:07] (03Merged) 10jenkins-bot: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite) [20:41:25] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1081461|Profiler: emit both statsd and dogstatsd (T359385)]] [20:41:28] T359385: Migrate MediaWiki.arclamp to statslib - https://phabricator.wikimedia.org/T359385 [20:41:53] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [20:45:35] jouncebot: nowandnext [20:45:35] For the next 0 hour(s) and 14 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2000) [20:45:35] In 0 hour(s) and 14 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2100) [20:45:53] !log tgr@deploy2002 cwhite, tgr: Backport for [[gerrit:1081461|Profiler: emit both statsd and dogstatsd (T359385)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:46:41] Reedy: are you planning to do deployments? I'd like to do the deployment server switchover after this deploy window, but I can wait if you need me to [20:47:11] kamila_: I wouldn't mind, it gets rid quite a lot of logspam [20:47:49] Reedy: ok, go ahead and lmk when you're done please :-) [20:48:09] once tgr_ is done with the current deployment, I;d like to claim mwdebug1002 to do debug an issue. [20:48:14] (03CR) 10Reedy: [C:03+2] AbstractIterator: Make PHP 8.1 compatible [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129914 (https://phabricator.wikimedia.org/T389515) (owner: 10Jforrester) [20:48:31] kamila_: if it's not urgent, I'd like to deploy the fix for T389433, and that will probably take a while (code is pretty much untestable outside production) [20:48:32] T389433: Fix stuck old cookies on Wikitech - https://phabricator.wikimedia.org/T389433 [20:49:31] cwhite: do you need to test? [20:49:32] (03PS1) 10Andrew Bogott: Update horizon version in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1129935 (https://phabricator.wikimedia.org/T380531) [20:49:57] tgr_: LGTM so far, no errors AFAICT [20:50:04] (03CR) 10Ecarg: [C:03+1] es_exporter: constrain wikifunctions query [puppet] - 10https://gerrit.wikimedia.org/r/1129927 (https://phabricator.wikimedia.org/T388174) (owner: 10Cwhite) [20:50:06] !log tgr@deploy2002 cwhite, tgr: Continuing with sync [20:50:36] tgr_: what is "a while"? it's not urgent, so you can go ahead, but I'd like to know roughly when I'll be able to start [20:51:03] (03CR) 10Andrew Bogott: [C:03+2] Update horizon version in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1129935 (https://phabricator.wikimedia.org/T380531) (owner: 10Andrew Bogott) [20:51:14] (03Merged) 10jenkins-bot: AbstractIterator: Make PHP 8.1 compatible [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129914 (https://phabricator.wikimedia.org/T389515) (owner: 10Jforrester) [20:51:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.382s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:51:38] I have two config patches that don't need testing, if those can be batched with Reedy's patch, that's ~15 min [20:51:49] then the wikitech patch is another half an hour maybe? [20:52:00] ok, cool, thanks tgr_ [20:53:22] Reedy: does that sound ok? [20:53:36] wfm. Mine doesn't need testing [20:53:44] as it's causing cli logspam (dumps) [20:54:57] (03PS1) 10Cwhite: es_exporter: add metric gathering for wikifunctions backend services [puppet] - 10https://gerrit.wikimedia.org/r/1129936 (https://phabricator.wikimedia.org/T388174) [20:56:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.39s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:56:28] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#10659322 (10Umherirrender) [20:57:36] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081461|Profiler: emit both statsd and dogstatsd (T359385)]] (duration: 16m 11s) [20:57:40] T359385: Migrate MediaWiki.arclamp to statslib - https://phabricator.wikimedia.org/T359385 [20:58:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129435 (https://phabricator.wikimedia.org/T389400) (owner: 10Superpes15) [20:58:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129905 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [20:59:21] (03Merged) 10jenkins-bot: Throttle exemption for Editathon in Ciudad de Buenos Aires - 29 March 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129435 (https://phabricator.wikimedia.org/T389400) (owner: 10Superpes15) [20:59:23] (03Merged) 10jenkins-bot: Enable SUL3 logins for 50% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129905 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [20:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10659334 (10phaultfinder) [20:59:44] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1129435|Throttle exemption for Editathon in Ciudad de Buenos Aires - 29 March 2025 (T389400)]], [[gerrit:1129905|Enable SUL3 logins for 50% of group 1 users (T384153)]], [[gerrit:1129914|AbstractIterator: Make PHP 8.1 compatible (T389515)]] [20:59:51] T389400: Lift IP for a edit-a-thon in Ciudad de Buenos Aires, Argentina 2025-03-29 - https://phabricator.wikimedia.org/T389400 [20:59:51] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [20:59:51] T389515: PHP Deprecated: Return type of Flow\Search\Iterators\AbstractIterator::current() should either be compatible with Iterator::current(): mixed, or the #[\ReturnTypeWillChange] attribute should be used - https://phabricator.wikimedia.org/T389515 [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2100) [21:00:31] thanks tgr_ :) [21:01:24] (03CR) 10Herron: [C:03+1] "Thanks for the ping, LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [21:04:35] !log tgr@deploy2002 tgr, jforrester, superpes: Backport for [[gerrit:1129435|Throttle exemption for Editathon in Ciudad de Buenos Aires - 29 March 2025 (T389400)]], [[gerrit:1129905|Enable SUL3 logins for 50% of group 1 users (T384153)]], [[gerrit:1129914|AbstractIterator: Make PHP 8.1 compatible (T389515)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:06:10] (03CR) 10Ecarg: [C:03+1] es_exporter: add metric gathering for wikifunctions backend services [puppet] - 10https://gerrit.wikimedia.org/r/1129936 (https://phabricator.wikimedia.org/T388174) (owner: 10Cwhite) [21:06:36] !log tgr@deploy2002 tgr, jforrester, superpes: Continuing with sync [21:13:19] (03PS1) 10Ahmon Dancy: cloud.yaml: Supply a reasonable default for profile::tlsproxy::envoy::global_cert_name [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) [21:13:41] (03CR) 10CI reject: [V:04-1] cloud.yaml: Supply a reasonable default for profile::tlsproxy::envoy::global_cert_name [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy) [21:14:11] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129435|Throttle exemption for Editathon in Ciudad de Buenos Aires - 29 March 2025 (T389400)]], [[gerrit:1129905|Enable SUL3 logins for 50% of group 1 users (T384153)]], [[gerrit:1129914|AbstractIterator: Make PHP 8.1 compatible (T389515)]] (duration: 14m 26s) [21:14:17] T389400: Lift IP for a edit-a-thon in Ciudad de Buenos Aires, Argentina 2025-03-29 - https://phabricator.wikimedia.org/T389400 [21:14:18] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [21:14:18] T389515: PHP Deprecated: Return type of Flow\Search\Iterators\AbstractIterator::current() should either be compatible with Iterator::current(): mixed, or the #[\ReturnTypeWillChange] attribute should be used - https://phabricator.wikimedia.org/T389515 [21:14:27] 10SRE-swift-storage, 06Commons, 07Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T389538#10659391 (10Pppery) [21:14:42] (03CR) 10Cwhite: [C:03+2] es_exporter: constrain wikifunctions query [puppet] - 10https://gerrit.wikimedia.org/r/1129927 (https://phabricator.wikimedia.org/T388174) (owner: 10Cwhite) [21:14:45] (03PS2) 10Ahmon Dancy: cloud.yaml: Supply default for profile::tlsproxy::envoy::global_cert_name [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) [21:14:53] (03CR) 10Cwhite: [C:03+2] es_exporter: add metric gathering for wikifunctions backend services [puppet] - 10https://gerrit.wikimedia.org/r/1129936 (https://phabricator.wikimedia.org/T388174) (owner: 10Cwhite) [21:14:54] Thanks tgr_ [21:14:57] :) [21:14:57] (03PS1) 10Ahmon Dancy: profile::tlsproxy::envoy: Tweak an error message [puppet] - 10https://gerrit.wikimedia.org/r/1129940 [21:15:15] deep breath [21:15:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) (owner: 10Gergő Tisza) [21:16:10] (03Merged) 10jenkins-bot: Clear stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129845 (https://phabricator.wikimedia.org/T389433) (owner: 10Gergő Tisza) [21:16:27] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1129845|Clear stuck session cookies on Wikitech (T389433)]] [21:16:31] T389433: Fix stuck old cookies on Wikitech - https://phabricator.wikimedia.org/T389433 [21:19:09] !log tgr@deploy2002 tgr: Backport for [[gerrit:1129845|Clear stuck session cookies on Wikitech (T389433)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:29:33] !log tgr@deploy2002 tgr: Continuing with sync [21:31:40] (03CR) 10BCornwall: [C:03+2] upgrade cp3073 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129860 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:31:55] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3073.esams.wmnet} and A:cp [21:32:27] FIRING: [5x] ProbeDown: Service restbase1043-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:32:50] (03CR) 10Ebernhardson: [C:03+1] cirrus: update alerts based on rc0 topics [alerts] - 10https://gerrit.wikimedia.org/r/1129889 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [21:33:39] (03CR) 10Ebernhardson: [C:03+1] cirrus-streaming-updater: produce to v1 update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124485 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [21:33:44] !log bootstrapping restbase1034-b/cassandra — T389423 [21:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:48] T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423 [21:34:19] (03PS1) 10Ahmon Dancy: SpiderPig: Require explicit hiera config to enable Spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) [21:34:21] !log bootstrapping restbase1043-b/cassandra — T389423 (previous msg(s) typo-ed) [21:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:02] (03PS2) 10Ahmon Dancy: SpiderPig: Require explicit hiera config to enable Spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) [21:36:58] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129845|Clear stuck session cookies on Wikitech (T389433)]] (duration: 20m 31s) [21:37:02] T389433: Fix stuck old cookies on Wikitech - https://phabricator.wikimedia.org/T389433 [21:37:27] FIRING: [5x] ProbeDown: Service restbase1043-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:38:01] (03CR) 10Dzahn: "thanks for the fix! Just the part that git blame tells me the line is like this since 2020 confuses me right now. Because the puppet erro" [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy) [21:38:03] (03PS1) 10Andrew Bogott: Update horizon version in codfw1dev, again [puppet] - 10https://gerrit.wikimedia.org/r/1129944 (https://phabricator.wikimedia.org/T380531) [21:38:12] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3073.esams.wmnet} and A:cp [21:38:29] (03CR) 10Andrew Bogott: [C:03+2] Update horizon version in codfw1dev, again [puppet] - 10https://gerrit.wikimedia.org/r/1129944 (https://phabricator.wikimedia.org/T380531) (owner: 10Andrew Bogott) [21:39:01] kamila_: all done, sorry for the wait! [21:39:10] !log late UTC deploys done [21:39:10] np, thanks tgr_ ! [21:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:21] (03CR) 10Ahmon Dancy: "Yeah, it's the "include profile::tlsproxy::envoy" at https://gerrit.wikimedia.org/g/operations/puppet/+/refs/changes/31/1094531/25/modules" [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy) [21:39:32] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [21:40:43] (03PS1) 10Jasmine: wmnet: update deployment CNAME record to deploy1003 [dns] - 10https://gerrit.wikimedia.org/r/1129945 (https://phabricator.wikimedia.org/T385155) [21:41:28] (03CR) 10Dzahn: "puppet breakage on non-prod-deployment servers -> https://phabricator.wikimedia.org/T383946#10658168 - thanks for the fix at https://gerri" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [21:41:38] (03CR) 10Dzahn: [C:03+2] "Gotcha! thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy) [21:42:57] (03CR) 10Dzahn: [C:03+2] "I really hope it's not going to affect other cloud VPS machines using envoy that aren't deployment servers.. this global cloud.yaml is bro" [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy) [21:43:18] jouncebot: nowandnext [21:43:18] For the next 0 hour(s) and 16 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2100) [21:43:19] In 8 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250321T0600) [21:44:26] Note that jasmine_ and I are switching the deployment server as part of the datacenter switchover process. The current deployment server will be deploy1003.eqiad.wmnet . We do not expect this to cause any issues, but ping me or jasmine_ if you think you found one! Thanks :-) [21:47:59] (03CR) 10Dzahn: "You may already be aware, but please keep in mind there are a couple other places in Hiera where "the deployment server" is defined:" [dns] - 10https://gerrit.wikimedia.org/r/1129945 (https://phabricator.wikimedia.org/T385155) (owner: 10Jasmine) [21:49:45] !log kamila@deploy2002 Locking from deployment [MediaWiki]: deployment server switch -- T385155 [21:49:49] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [21:52:16] (03CR) 10Ahmon Dancy: [C:04-1] "Holding. Not working as expected in beta yet." [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [21:54:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10659704 (10phaultfinder) [21:54:59] (03CR) 10Kamila Součková: [C:03+2] wmnet: point deploy server at eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127073 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [21:56:26] (03Abandoned) 10Jasmine: wmnet: update deployment CNAME record to deploy1003 [dns] - 10https://gerrit.wikimedia.org/r/1129945 (https://phabricator.wikimedia.org/T385155) (owner: 10Jasmine) [21:56:44] (03CR) 10Dzahn: "probably safer to pass the parameter through to the systemd::service defines so that services are stopped if you ever go backwards from pr" [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [21:56:44] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#10659728 (10Tgr) >>! In T389543#10659214, @Tgr wrote: >... [21:58:34] jouncebot: nowandnext [21:58:34] For the next 0 hour(s) and 1 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250320T2100) [21:58:34] In 8 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250321T0600) [21:59:20] (03PS2) 10Kamila Součková: wmnet: point deploy server at eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127073 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [21:59:32] (03CR) 10Dzahn: [C:03+2] "this fixed the error but there is a new one after that:" [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy) [21:59:42] (03CR) 10Ahmon Dancy: [C:04-1] "That happens on lines 38 and 43 of modules/profile/manifests/scap/spiderpig.pp. Or do you mean something else?" [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [22:00:00] (03CR) 10Kamila Součková: [C:03+2] wmnet: point deploy server at eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127073 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [22:00:26] (03CR) 10Dzahn: [C:03+2] "Hmm.. this seems like it would affect all cloud VPS machines using envoy now.. tempted to revert" [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy) [22:00:58] (03CR) 10Ahmon Dancy: "Go ahead and revert. Let's see if we can figure out something better tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1129938 (https://phabricator.wikimedia.org/T383946) (owner: 10Ahmon Dancy) [22:00:58] !log kamila@dns1004 START - running authdns-update [22:01:22] kamila_: are you planning to change common.yaml and common/scap.yaml after the DNS change, not before? [22:01:31] mutante: I've lost my steam for the day. Can we regroup tomorrow? [22:01:47] dancy: sounds good, yes [22:02:05] I guess revert is slightly better than not revert [22:02:15] Agreed. [22:02:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.94% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:02:40] (03PS1) 10Dzahn: Revert "cloud.yaml: Supply default for profile::tlsproxy::envoy::global_cert_name" [puppet] - 10https://gerrit.wikimedia.org/r/1129949 [22:02:41] mutante: after, I am sitting on a scap lock [22:02:55] kamila_: gotcha!:) [22:03:02] (03CR) 10CI reject: [V:04-1] Revert "cloud.yaml: Supply default for profile::tlsproxy::envoy::global_cert_name" [puppet] - 10https://gerrit.wikimedia.org/r/1129949 (owner: 10Dzahn) [22:03:30] gotta love -1 on reverts [22:03:50] noms [22:03:52] ah, long lines.. [22:04:03] (03PS2) 10Dzahn: Revert "cloud.yaml: Supply default for profile::tlsproxy::envoy::global_cert_name" [puppet] - 10https://gerrit.wikimedia.org/r/1129949 [22:05:07] (03CR) 10Dzahn: [C:03+2] Revert "cloud.yaml: Supply default for profile::tlsproxy::envoy::global_cert_name" [puppet] - 10https://gerrit.wikimedia.org/r/1129949 (owner: 10Dzahn) [22:07:02] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet [22:07:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:08:12] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10659784 (10RobH) Support won't push the case further until we update all firmware, doing so now. [22:08:49] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cp4047.ulsfo.wmnet [22:09:04] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet [22:09:24] !log dzahn@dns1004 START - running authdns-update [22:09:29] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cp4047.ulsfo.wmnet [22:11:24] (03CR) 10BCornwall: [C:03+2] upgrade cp3075 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129861 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [22:12:46] !log kamila@dns1004 START - running authdns-update [22:13:08] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3075.esams.wmnet} and A:cp [22:18:48] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3075.esams.wmnet} and A:cp [22:22:12] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4046.ulsfo.wmnet [22:23:10] (03PS1) 10Kamila Součková: hieradata: update deployment_server to deploy1003 [puppet] - 10https://gerrit.wikimedia.org/r/1129952 (https://phabricator.wikimedia.org/T385155) [22:26:16] (03CR) 10Jasmine: [C:03+1] hieradata: update deployment_server to deploy1003 [puppet] - 10https://gerrit.wikimedia.org/r/1129952 (https://phabricator.wikimedia.org/T385155) (owner: 10Kamila Součková) [22:26:25] 06SRE, 06Traffic, 13Patch-For-Review: Incorrect X-Cache-Status reported by deployment-prep caches - https://phabricator.wikimedia.org/T269825#10659837 (10Aklapper) @BBlack, @Vgutierrez: Could you please answer the last comment? Thanks in advance! [22:38:22] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns1005.wikimedia.org [22:38:55] !log depool dns1005 to debug zone files not in sync with dns.git [22:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:04] !log kamila@dns1004 START - running authdns-update [22:39:08] (03CR) 10Pppery: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [22:41:10] !log kamila@dns1004 END - running authdns-update [22:41:11] (03CR) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [22:41:19] !log switch deployment.w.o DNS to eqiad [22:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:26] (03CR) 10Kamila Součková: [C:03+2] hieradata: update deployment_server to deploy1003 [puppet] - 10https://gerrit.wikimedia.org/r/1129952 (https://phabricator.wikimedia.org/T385155) (owner: 10Kamila Součková) [22:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10659911 (10phaultfinder) [22:47:11] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet [22:48:11] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp4047.ulsfo.wmnet [22:49:19] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet [22:51:42] (03CR) 10BCornwall: [C:03+2] upgrade cp3076 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129862 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [22:51:58] (03PS2) 10Tim Starling: block: Don't modify an autoblock when the user specifies an IP [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129957 (https://phabricator.wikimedia.org/T389452) [22:56:09] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3076.esams.wmnet} and A:cp [22:56:42] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10659939 (10RobH) Support is requiring firmware updates, as this is pretty far out of date. Current iDrac firmware: 5.10.30.00 Current BIOS firmware: 1.6.5 Support stated we should go from 5.10.30... [22:58:16] !log kamila@deploy2002 Unlocked for deployment [MediaWiki]: deployment server switch -- T385155 (duration: 68m 30s) [22:58:20] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [23:01:39] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3076.esams.wmnet} and A:cp [23:02:19] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10659951 (10BCornwall) Thanks for doing this. If you want any assistance on doing the updates, let me know - I'd do it right now but it looks like you might be in the middle of upgrades and I don't wanna... [23:04:14] (03PS1) 10Kimberly Sarabia: Stream registration for article summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) [23:05:49] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp4047.ulsfo.wmnet [23:06:45] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet [23:06:45] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10659982 (10RobH) Yeah, it took the same command 3 times for it to finally not time out or break in some way, but it finally updated to cp4047 (IDRAC): now at version: 5.10.50.0 . Now to move it along u... [23:10:20] (03CR) 10Tim Starling: [C:03+2] block: Don't modify an autoblock when the user specifies an IP [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129957 (https://phabricator.wikimedia.org/T389452) (owner: 10Tim Starling) [23:10:53] !log kamila@deploy1003 Started scap sync-world: Test deployment to validate deployment server switchover - T385155 [23:10:57] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [23:13:47] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns1005.wikimedia.org [23:14:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660004 (10phaultfinder) [23:15:05] (03Merged) 10jenkins-bot: block: Don't modify an autoblock when the user specifies an IP [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129957 (https://phabricator.wikimedia.org/T389452) (owner: 10Tim Starling) [23:15:43] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp4047.ulsfo.wmnet [23:18:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10660007 (10Jhancock.wm) 05Open→03Resolved [23:18:10] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10660011 (10RobH) > Support, > > Can you confirm you see the failure and what part the failure occurred on with the logs sent over? > > Updating the firmware now. > > Please advise, > > Hi Rob >... [23:18:20] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [23:18:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10660012 (10Jhancock.wm) @MoritzMuehlenhoff ready! [23:18:33] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet [23:18:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:19:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660013 (10phaultfinder) [23:25:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660014 (10phaultfinder) [23:28:29] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp4047.ulsfo.wmnet [23:29:44] !log updating cp4047 bios via T387238, server will flap but is not pooled [23:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:49] T387238: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238 [23:30:20] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet [23:30:36] !log kamila@deploy1003 Finished scap sync-world: Test deployment to validate deployment server switchover - T385155 (duration: 19m 42s) [23:30:39] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [23:30:43] (03CR) 10Dzahn: "oh.. duh! yea, please ignore that previous comment :)" [puppet] - 10https://gerrit.wikimedia.org/r/1129943 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [23:30:48] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660022 (10phaultfinder) [23:30:49] (03CR) 10BCornwall: [C:03+2] upgrade cp3077 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129863 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:31:21] TimStarling: you can deploy if you want [23:31:36] !log robh@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4047.ulsfo.wmnet [23:32:02] (03CR) 10Pppery: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [23:32:05] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3077.esams.wmnet} and A:cp [23:32:27] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:33] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3077.esams.wmnet} and A:cp [23:38:24] !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1129957|block: Don't modify an autoblock when the user specifies an IP (T389452)]] [23:42:13] !log brett@dns1005 START - running authdns-update [23:43:43] !log brett@dns1005 END - running authdns-update [23:45:47] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cp4047.ulsfo.wmnet [23:45:55] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp4047.ulsfo.wmnet [23:46:15] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet [23:47:28] !log robh@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4047.ulsfo.wmnet [23:53:46] !log tstarling@deploy1003 tstarling: Backport for [[gerrit:1129957|block: Don't modify an autoblock when the user specifies an IP (T389452)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:53:57] !log tstarling@deploy1003 tstarling: Continuing with sync [23:54:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660040 (10phaultfinder) [23:58:10] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cp4047.ulsfo.wmnet [23:58:13] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp4047.ulsfo.wmnet [23:58:21] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet [23:59:26] !log robh@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4047.ulsfo.wmnet