[00:05:48] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666278 (10phaultfinder) [00:38:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1130337 [00:38:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1130337 (owner: 10TrainBranchBot) [00:49:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1130337 (owner: 10TrainBranchBot) [00:54:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666281 (10phaultfinder) [01:04:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666285 (10phaultfinder) [01:08:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1130338 [01:08:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1130338 (owner: 10TrainBranchBot) [01:26:45] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1130338 (owner: 10TrainBranchBot) [01:50:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666299 (10phaultfinder) [02:14:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666344 (10phaultfinder) [02:34:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666361 (10phaultfinder) [03:09:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666377 (10phaultfinder) [03:18:21] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [03:18:45] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [03:20:56] Original exception: [b5d67861-41ee-4605-a8fb-076ca1d19533] 2025-03-24 03:20:34: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError" [03:21:18] have gotten a few of these trying to open arbitrary pages [03:22:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 2.451% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:22:28] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:24:15] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:24:39] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:27:15] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 21.35% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:29:15] RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:35:30] FIRING: [6x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:40:34] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1194 - https://phabricator.wikimedia.org/T389751 (10ops-monitoring-bot) 03NEW [03:59:07] Original exception: [f342ce9e-8d63-4deb-b51e-8c10d15e7639] 2025-03-24 03:58:53: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError" [03:59:15] getting lots of these again [03:59:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666387 (10phaultfinder) [04:00:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 4.634% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:02:15] FIRING: [9x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:05:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 5.102% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:07:15] RESOLVED: [9x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:14:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666391 (10phaultfinder) [04:15:02] !incidents [04:15:03] 5778 (UNACKED) Manual (paged) by Rae Adimer (radimer@wikimedia.org): MediaWiki internal error on loading Wikipedia [04:15:10] !ack 5778 [04:15:10] 5778 (ACKED) Manual (paged) by Rae Adimer (radimer@wikimedia.org): MediaWiki internal error on loading Wikipedia [04:42:52] (03PS1) 10C. Scott Ananian: Turn on Parsoid fragment support everywhere (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130343 (https://phabricator.wikimedia.org/T374661) [04:43:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130343 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian) [05:00:31] (03PS1) 10Abijeet Patro: AX: Disable automatic translation entrypoints before release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130345 (https://phabricator.wikimedia.org/T389176) [05:00:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666417 (10phaultfinder) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666425 (10phaultfinder) [05:15:47] (03PS1) 10Seanleong-wmde: Increase entityAccessLimit from 400 to 500 for all wikis except commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 [05:16:47] (03PS2) 10Abijeet Patro: AX: Disable automatic translation entrypoints before release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130345 (https://phabricator.wikimedia.org/T389176) [05:16:51] (03PS2) 10Seanleong-wmde: Increase entityAccessLimit from 400 to 500 forall wikis except commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (https://phabricator.wikimedia.org/T384455) [05:17:34] (03PS3) 10Seanleong-wmde: Increase entityAccessLimit from 400 to 500 for all wikis except commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (https://phabricator.wikimedia.org/T384455) [05:19:41] (03PS4) 10Seanleong-wmde: Increase entityAccessLimit from 400 to 500 for all wikis except commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (https://phabricator.wikimedia.org/T384455) [05:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666431 (10phaultfinder) [05:30:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666433 (10phaultfinder) [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:14:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666451 (10phaultfinder) [06:15:19] (03PS1) 10Kevin Bazira: changeprop: add liftwing revertrisk-language-agnostic stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130349 (https://phabricator.wikimedia.org/T326179) [06:18:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s2 T389376 [06:18:06] T389376: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T389376 [06:18:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2207 with weight 0 T389376', diff saved to https://phabricator.wikimedia.org/P74304 and previous config saved to /var/cache/conftool/dbconfig/20250324-061812-marostegui.json [06:19:07] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1129298 (https://phabricator.wikimedia.org/T389376) (owner: 10Gerrit maintenance bot) [06:22:02] !log Starting s2 codfw failover from db2204 to db2207 - T389376 [06:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2207 to s2 primary T389376', diff saved to https://phabricator.wikimedia.org/P74305 and previous config saved to /var/cache/conftool/dbconfig/20250324-062223-marostegui.json [06:22:25] (03CR) 10Kevin Bazira: "For more context, this follows a similar impelementation as the article-country stream in: I1b396ed95c4919f767795a354d4c943864127088" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130349 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [06:23:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2204 T389376', diff saved to https://phabricator.wikimedia.org/P74306 and previous config saved to /var/cache/conftool/dbconfig/20250324-062338-marostegui.json [06:23:42] T389376: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T389376 [06:25:31] (03PS1) 10Marostegui: db2204: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1130462 [06:25:58] (03CR) 10Marostegui: [C:03+2] db2204: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1130462 (owner: 10Marostegui) [06:26:16] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2204.codfw.wmnet [06:30:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2204.codfw.wmnet [06:32:09] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2204.codfw.wmnet with reason: Index rebuild [06:50:10] (03PS1) 10Marostegui: Revert "db2204: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1130463 [06:51:02] (03CR) 10Marostegui: [C:03+2] Revert "db2204: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1130463 (owner: 10Marostegui) [06:52:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2229 with weight 0 T389382', diff saved to https://phabricator.wikimedia.org/P74307 and previous config saved to /var/cache/conftool/dbconfig/20250324-065236-marostegui.json [06:52:41] T389382: Switchover s6 master (db2214 -> db2229) - https://phabricator.wikimedia.org/T389382 [06:53:20] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s6 T389382 [06:54:51] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2229 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1129309 (https://phabricator.wikimedia.org/T389382) (owner: 10Gerrit maintenance bot) [06:59:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666494 (10phaultfinder) [07:01:17] !log Starting s6 codfw failover from db2214 to db2229 - T389382 [07:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:21] T389382: Switchover s6 master (db2214 -> db2229) - https://phabricator.wikimedia.org/T389382 [07:01:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2229 to s6 primary T389382', diff saved to https://phabricator.wikimedia.org/P74308 and previous config saved to /var/cache/conftool/dbconfig/20250324-070147-marostegui.json [07:02:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2214 T389382', diff saved to https://phabricator.wikimedia.org/P74309 and previous config saved to /var/cache/conftool/dbconfig/20250324-070245-marostegui.json [07:04:24] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2214.codfw.wmnet [07:08:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2214.codfw.wmnet [07:10:07] (03PS1) 10Marostegui: db2214: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1130466 (https://phabricator.wikimedia.org/T389754) [07:10:37] (03CR) 10Marostegui: [C:03+2] db2214: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1130466 (https://phabricator.wikimedia.org/T389754) (owner: 10Marostegui) [07:16:43] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2169.codfw.wmnet onto db2214.codfw.wmnet [07:16:51] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool db2214 - Depool db2169.codfw.wmnet to then clone it to db2214.codfw.wmnet - marostegui@cumin1002 [07:16:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2214 - Depool db2169.codfw.wmnet to then clone it to db2214.codfw.wmnet - marostegui@cumin1002 [07:18:21] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [07:18:45] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [07:24:14] (03PS4) 10Cyndywikime: Growth: Remove unused PHP config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128828 (https://phabricator.wikimedia.org/T388787) [07:24:14] (03CR) 10Cyndywikime: "This change is now ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128828 (https://phabricator.wikimedia.org/T388787) (owner: 10Cyndywikime) [07:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666546 (10phaultfinder) [07:28:36] !log rebalance ganeti eqiad/D following reimages T382507 [07:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:40] T382507: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507 [07:35:30] FIRING: [6x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:36:45] (03CR) 10Muehlenhoff: [C:03+2] maps/bookworm: Cleanup confusing Hiera settings for postgresql replication slots [puppet] - 10https://gerrit.wikimedia.org/r/1130100 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:42:19] (03PS1) 10Filippo Giunchedi: idp: add session timeout settings [puppet] - 10https://gerrit.wikimedia.org/r/1130521 (https://phabricator.wikimedia.org/T389629) [07:42:20] (03PS1) 10Filippo Giunchedi: hieradata: bump grafana cas session timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1130522 (https://phabricator.wikimedia.org/T389629) [07:49:17] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 15): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5136/c" [puppet] - 10https://gerrit.wikimedia.org/r/1130521 (https://phabricator.wikimedia.org/T389629) (owner: 10Filippo Giunchedi) [07:50:14] (03CR) 10Slyngshede: [C:03+2] Handle empty query on block user page [software/bitu] - 10https://gerrit.wikimedia.org/r/1128399 (https://phabricator.wikimedia.org/T385947) (owner: 10Slyngshede) [07:50:25] (03CR) 10Muehlenhoff: "It's a good point wrt churn, changing max_wal_senders requires a postgresql restart, so this would cause issues when adding/removing a rep" [puppet] - 10https://gerrit.wikimedia.org/r/1130106 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:50:28] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5140/co" [puppet] - 10https://gerrit.wikimedia.org/r/1130522 (https://phabricator.wikimedia.org/T389629) (owner: 10Filippo Giunchedi) [07:51:09] (03CR) 10Filippo Giunchedi: [V:03+1] "Prerequisite for Icee2d60040 to bump grafana timeouts as per task" [puppet] - 10https://gerrit.wikimedia.org/r/1130521 (https://phabricator.wikimedia.org/T389629) (owner: 10Filippo Giunchedi) [07:52:58] (03Merged) 10jenkins-bot: Handle empty query on block user page [software/bitu] - 10https://gerrit.wikimedia.org/r/1128399 (https://phabricator.wikimedia.org/T385947) (owner: 10Slyngshede) [07:54:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666575 (10phaultfinder) [07:55:14] (03PS4) 10Muehlenhoff: osm: Handle new requirements for Postgres replication slots [puppet] - 10https://gerrit.wikimedia.org/r/1130106 (https://phabricator.wikimedia.org/T381565) [07:56:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130106 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:00:04] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T0800). [08:00:05] tgr: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:29] o/ [08:00:35] I can deploy [08:04:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666585 (10phaultfinder) [08:07:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130328 (owner: 10Gergő Tisza) [08:09:12] (03CR) 10Muehlenhoff: [C:03+2] maps: Cleanup confusing Hiera settings for postgresql replication slots [puppet] - 10https://gerrit.wikimedia.org/r/1130099 (owner: 10Muehlenhoff) [08:11:13] (03Merged) 10jenkins-bot: Preserve 'useformat' param when accessing Special:ChangePassword [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130328 (owner: 10Gergő Tisza) [08:11:32] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1130328|Preserve 'useformat' param when accessing Special:ChangePassword]] [08:12:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74311 and previous config saved to /var/cache/conftool/dbconfig/20250324-081258-root.json [08:14:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666592 (10phaultfinder) [08:15:53] (03PS2) 10Federico Ceratto: clone.py, clone_test.py: Check if the target host is known to dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1127071 (https://phabricator.wikimedia.org/T387023) [08:15:53] (03CR) 10Federico Ceratto: "Ready for review. Tested with dry-run (it executes the check anyways)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1127071 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [08:17:09] (03PS7) 10Krinkle: search-grafana-dashboards: format results as markdown, and add --json [software] - 10https://gerrit.wikimedia.org/r/1129242 (owner: 10Filippo Giunchedi) [08:19:33] (03PS8) 10Krinkle: search-grafana-dashboards: format results as markdown, and add --json [software] - 10https://gerrit.wikimedia.org/r/1129242 (owner: 10Filippo Giunchedi) [08:20:24] (03CR) 10Krinkle: "I re-ordered it to keep `search()` on top for easy reference and console usage." [software] - 10https://gerrit.wikimedia.org/r/1129242 (owner: 10Filippo Giunchedi) [08:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666600 (10phaultfinder) [08:26:03] (03CR) 10Elukey: [C:03+1] osm: Handle new requirements for Postgres replication slots [puppet] - 10https://gerrit.wikimedia.org/r/1130106 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:27:29] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve2004.codfw.wmnet with OS bookworm [08:28:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74312 and previous config saved to /var/cache/conftool/dbconfig/20250324-082804-root.json [08:28:53] jouncebot: nowandnext [08:28:53] For the next 0 hour(s) and 31 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T0800) [08:28:53] In 1 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T1000) [08:30:25] !log tgr@deploy1003 tgr: Backport for [[gerrit:1130328|Preserve 'useformat' param when accessing Special:ChangePassword]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:31:10] (03CR) 10Muehlenhoff: [C:03+2] osm: Handle new requirements for Postgres replication slots [puppet] - 10https://gerrit.wikimedia.org/r/1130106 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:32:27] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: produce to v1 update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124485 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [08:34:09] (03Merged) 10jenkins-bot: cirrus-streaming-updater: produce to v1 update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124485 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [08:35:29] !log tgr@deploy1003 tgr: Continuing with sync [08:36:45] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:36:56] (03CR) 10Gergő Tisza: [C:03+2] authmanager: Use an URL parameter to keep track of returns [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130320 (https://phabricator.wikimedia.org/T388250) (owner: 10Gergő Tisza) [08:36:59] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:38:31] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:39:53] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:41:10] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:42:05] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:42:54] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:43:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74313 and previous config saved to /var/cache/conftool/dbconfig/20250324-084309-root.json [08:44:33] jouncebot: now and next [08:44:33] For the next 0 hour(s) and 15 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T0800) [08:45:16] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130328|Preserve 'useformat' param when accessing Special:ChangePassword]] (duration: 33m 43s) [08:45:18] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: host reimage [08:45:53] (03PS1) 10DCausse: Revert "cirrus-streaming-updater: produce to v1 update streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130525 [08:46:02] (03CR) 10DCausse: [C:03+2] Revert "cirrus-streaming-updater: produce to v1 update streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130525 (owner: 10DCausse) [08:47:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130320 (https://phabricator.wikimedia.org/T388250) (owner: 10Gergő Tisza) [08:47:47] (03Merged) 10jenkins-bot: Revert "cirrus-streaming-updater: produce to v1 update streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130525 (owner: 10DCausse) [08:47:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: host reimage [08:47:55] (03CR) 10Filippo Giunchedi: [C:03+2] logstash: move filter_truncate before indexing/output [puppet] - 10https://gerrit.wikimedia.org/r/1129128 (https://phabricator.wikimedia.org/T389072) (owner: 10Filippo Giunchedi) [08:48:14] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:48:15] (03Merged) 10jenkins-bot: authmanager: Use an URL parameter to keep track of returns [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130320 (https://phabricator.wikimedia.org/T388250) (owner: 10Gergő Tisza) [08:48:23] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:48:31] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1130320|authmanager: Use an URL parameter to keep track of returns (T388250)]] [08:48:35] T388250: LogicException: CentralAuthReturnRequest not found - https://phabricator.wikimedia.org/T388250 [08:49:16] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:49:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666658 (10phaultfinder) [08:52:35] tgr_: would you mind letting me know once you are done with the window ? [08:52:56] ack [08:53:01] thank you [08:53:45] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [08:54:02] !log tgr@deploy1003 tgr: Backport for [[gerrit:1130320|authmanager: Use an URL parameter to keep track of returns (T388250)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:54:06] T388250: LogicException: CentralAuthReturnRequest not found - https://phabricator.wikimedia.org/T388250 [08:54:24] (03CR) 10Michael Große: [C:03+1] "A timeout of 24 hours feels reasonable to me. Thank you for taking care of this 🙏" [puppet] - 10https://gerrit.wikimedia.org/r/1130522 (https://phabricator.wikimedia.org/T389629) (owner: 10Filippo Giunchedi) [08:55:26] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:55:37] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:56:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-ext releases routed via main (k8s) 1.391s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:56:33] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:57:29] !log tgr@deploy1003 tgr: Continuing with sync [08:58:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74314 and previous config saved to /var/cache/conftool/dbconfig/20250324-085815-root.json [08:58:45] RESOLVED: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [09:01:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-ext releases routed via main (k8s) 1.391s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:01:47] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:01:47] (03PS2) 10Filippo Giunchedi: add statsv throughput alerts [alerts] - 10https://gerrit.wikimedia.org/r/1129899 (https://phabricator.wikimedia.org/T389469) (owner: 10Cwhite) [09:02:07] (03CR) 10Elukey: [C:03+1] "I didn't forget about this one, I'll try to set up some "safe" way to deploy it with Traffic and Service Ops :)" [puppet] - 10https://gerrit.wikimedia.org/r/1123622 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [09:02:35] (03CR) 10Filippo Giunchedi: [C:03+1] "I've boldly adjusted the alert and dashboard to use per-second rate()s, LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1129899 (https://phabricator.wikimedia.org/T389469) (owner: 10Cwhite) [09:03:13] (03PS1) 10DCausse: Revert^2 "cirrus-streaming-updater: produce to v1 update streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130527 [09:03:21] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:03:51] (03PS2) 10DCausse: Revert^2 "cirrus-streaming-updater: produce to v1 update streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130527 [09:04:49] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130320|authmanager: Use an URL parameter to keep track of returns (T388250)]] (duration: 16m 18s) [09:04:53] T388250: LogicException: CentralAuthReturnRequest not found - https://phabricator.wikimedia.org/T388250 [09:05:01] godog: done [09:05:10] !log morning UTC deploys done [09:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2004.codfw.wmnet with OS bookworm [09:05:35] tgr_: neat, thank you [09:07:33] I see a "Wikimedia\Rdbms\LoadMonitor::computeServerState: host db2145 is not up?" error in the logs, don't know if that's normal DB error log noise or something else [09:09:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.016s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:13:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74315 and previous config saved to /var/cache/conftool/dbconfig/20250324-091320-root.json [09:13:35] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:14:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.403s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:14:58] (03CR) 10Volans: "question inline on the behavior" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130135 (owner: 10Ssingh) [09:15:08] !log restarting purged on A:cp due to T389707 [09:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:12] T389707: purged event lag keeps piling up in codfw topics after switchover - https://phabricator.wikimedia.org/T389707 [09:17:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10666709 (10elukey) @Jclark-ctr elastic1119 should be ready for reimage now (I've set UEFI in provisioning, hope it was t... [09:21:15] (03CR) 10Peter Fischer: [C:03+2] Revert^2 "cirrus-streaming-updater: produce to v1 update streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130527 (owner: 10DCausse) [09:23:09] (03Merged) 10jenkins-bot: Revert^2 "cirrus-streaming-updater: produce to v1 update streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130527 (owner: 10DCausse) [09:25:38] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:25:50] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:27:59] (03CR) 10Tiziano Fogli: sre.puppet.sync-netbox-hiera: add rack/row to network_devices (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [09:29:34] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:29:50] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:32:58] dcausse: FYI for unrelated reasons though I noticed a bunch of logstash warnings like this https://phabricator.wikimedia.org/P74316 [09:33:37] godog: oh thanks for the heads up, looking [09:34:23] dcausse: ack no problem, not sure if related to your change [09:34:37] might be liftwing crafting improper json, looking if it's new or not [09:34:49] (03PS5) 10Cyndywikime: Growth: Remove unused PHP config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128828 (https://phabricator.wikimedia.org/T388787) [09:35:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666804 (10phaultfinder) [09:37:06] (03CR) 10Ayounsi: sre.network.cf: log if no changes were made (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1130135 (owner: 10Ssingh) [09:37:38] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.rc0` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [09:38:21] (03CR) 10Volans: "Couple of post-merge nits." [cookbooks] - 10https://gerrit.wikimedia.org/r/1130107 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [09:38:35] (03CR) 10Fabfur: [C:03+1] "agree with elukey about trying it on a "safe" location first" [puppet] - 10https://gerrit.wikimedia.org/r/1123622 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [09:39:30] (03PS2) 10DCausse: cirrus: update alerts based on rc0 topics [alerts] - 10https://gerrit.wikimedia.org/r/1129889 (https://phabricator.wikimedia.org/T375821) [09:39:39] (03CR) 10DCausse: [C:03+2] cirrus: update alerts based on rc0 topics [alerts] - 10https://gerrit.wikimedia.org/r/1129889 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [09:41:04] (03CR) 10Volans: [C:03+2] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [09:41:28] (03Merged) 10jenkins-bot: cirrus: update alerts based on rc0 topics [alerts] - 10https://gerrit.wikimedia.org/r/1129889 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [09:42:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.273s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:45:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666851 (10phaultfinder) [09:46:52] (03Merged) 10jenkins-bot: netbox: refactor support for GraphQL queries [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [09:47:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.273s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:47:39] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.rc0` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [09:51:35] jouncebot: nowandnext [09:51:35] No deployments scheduled for the next 0 hour(s) and 8 minute(s) [09:51:35] In 0 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T1000) [09:51:43] (03PS1) 10Cyndywikime: [Growth] enwiki: Release Add Link to 20% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130532 (https://phabricator.wikimedia.org/T388289) [09:52:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [09:52:39] RESOLVED: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.rc0` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [09:52:49] (03PS1) 10Ladsgroup: Bump thumbnail steps to 35% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130533 (https://phabricator.wikimedia.org/T360589) [09:52:49] !incidents [09:52:50] 5778 (RESOLVED) Manual (paged) by Rae Adimer (radimer@wikimedia.org): MediaWiki internal error on loading Wikipedia [09:52:53] godog: I suspect it's liftwing printing some debug message in stderr which is then causing logstash to complain, filing a task [09:53:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130533 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [09:53:45] !incidents [09:53:46] 5778 (RESOLVED) Manual (paged) by Rae Adimer (radimer@wikimedia.org): MediaWiki internal error on loading Wikipedia [09:54:14] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1130521 (https://phabricator.wikimedia.org/T389629) (owner: 10Filippo Giunchedi) [09:54:29] (03Merged) 10jenkins-bot: Bump thumbnail steps to 35% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130533 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [09:54:46] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1130533|Bump thumbnail steps to 35% (T360589)]] [09:54:50] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [09:54:58] tappof: as far as I can tell this is the old pag.e, id 5770 which is open since Mar 21 [09:55:01] (03PS1) 10Btullis: Temporarily disable gobblin on an-launchger1002 [puppet] - 10https://gerrit.wikimedia.org/r/1130535 (https://phabricator.wikimedia.org/T376800) [09:55:38] (03PS2) 10Btullis: Temporarily disable gobblin on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1130535 (https://phabricator.wikimedia.org/T376800) [09:56:23] marostegui: hi, around? I want to start deploying patches for mainstash [09:56:57] (03CR) 10Cyndywikime: "This change is now ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130532 (https://phabricator.wikimedia.org/T388289) (owner: 10Cyndywikime) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T1000) [10:00:14] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1130533|Bump thumbnail steps to 35% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:00:14] ok jelto thank you [10:00:18] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:01:50] (03PS1) 10Elukey: role::ml_k8s::worker: move to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1130536 (https://phabricator.wikimedia.org/T387854) [10:02:26] (03CR) 10Brouberol: [C:03+1] Temporarily disable gobblin on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1130535 (https://phabricator.wikimedia.org/T376800) (owner: 10Btullis) [10:02:36] RESOLVED: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [10:02:38] (03CR) 10Btullis: [C:03+2] Temporarily disable gobblin on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1130535 (https://phabricator.wikimedia.org/T376800) (owner: 10Btullis) [10:03:23] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [10:05:15] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 5 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5141/" [puppet] - 10https://gerrit.wikimedia.org/r/1130536 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [10:05:35] (03PS6) 10Cyndywikime: Growth: Remove unused PHP config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128828 (https://phabricator.wikimedia.org/T388787) [10:05:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10666943 (10phaultfinder) [10:10:42] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130533|Bump thumbnail steps to 35% (T360589)]] (duration: 15m 56s) [10:10:46] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:10:53] I have a scary patch to deploy [10:11:07] (03PS16) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [10:12:40] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:13:15] !log shutdown all SG.IX peers - T386987 [10:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125556 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [10:14:46] (03Merged) 10jenkins-bot: Migrate x2 off LB config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125556 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [10:14:58] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1125556|Migrate x2 off LB config (T383327 T387654)]] [10:15:04] T383327: Re-architecture mainstash (x2) to allow easier maintenance - https://phabricator.wikimedia.org/T383327 [10:15:04] T387654: Re-evaluate whether WMF's MainStash config should use LBFactory - https://phabricator.wikimedia.org/T387654 [10:15:31] mergeMessageFileList.php generated PHP notices/warnings: [10:15:31] Notice: Undefined variable: wmgMainStashServers in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 679 [10:15:31] Warning: Invalid argument supplied for foreach() in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 679 [10:15:35] that was fast [10:17:20] (03PS1) 10Muehlenhoff: Move maps-test2001/2002 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1130539 (https://phabricator.wikimedia.org/T381565) [10:17:21] (03PS1) 10Muehlenhoff: Switch maps-test2001 to master_bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1130540 (https://phabricator.wikimedia.org/T381565) [10:17:23] (03PS1) 10Muehlenhoff: Move maps-test2002 to replica_bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1130541 (https://phabricator.wikimedia.org/T381565) [10:17:53] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:18:50] (03CR) 10Elukey: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1130539 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:19:00] (03CR) 10Elukey: [C:03+1] Switch maps-test2001 to master_bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1130540 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:19:07] (03PS1) 10Ladsgroup: etcd: Make Mainstash config globa variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130542 (https://phabricator.wikimedia.org/T383327) [10:19:24] (03CR) 10Elukey: [C:03+1] Switch maps-test2001 to master_bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1130540 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:19:52] (03PS2) 10Ladsgroup: etcd: Make Mainstash config global variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130542 (https://phabricator.wikimedia.org/T383327) [10:19:56] (03CR) 10Ladsgroup: [C:03+2] etcd: Make Mainstash config global variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130542 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [10:20:00] (03CR) 10Elukey: [C:03+1] Move maps-test2002 to replica_bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1130541 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:20:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130542 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [10:20:47] (03Merged) 10jenkins-bot: etcd: Make Mainstash config global variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130542 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [10:21:02] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1125556|Migrate x2 off LB config (T383327 T387654)]], [[gerrit:1130542|etcd: Make Mainstash config global variable (T383327 T387654)]] [10:21:07] T383327: Re-architecture mainstash (x2) to allow easier maintenance - https://phabricator.wikimedia.org/T383327 [10:21:07] T387654: Re-evaluate whether WMF's MainStash config should use LBFactory - https://phabricator.wikimedia.org/T387654 [10:21:15] (03PS17) 10Elukey: sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [10:21:27] (03CR) 10Elukey: "Ready for a first pass!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey) [10:23:03] Amir1: Do you need me for those? I am busy with switchovers with federico3 [10:23:10] (03CR) 10Elukey: [V:03+1 C:03+2] role::ml_k8s::worker: move to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1130536 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [10:23:31] for this patch no. it's noop [10:24:07] but for enabling data redundancy, It would be nice [10:24:08] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2169.codfw.wmnet onto db2214.codfw.wmnet [10:25:22] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] idp: add session timeout settings [puppet] - 10https://gerrit.wikimedia.org/r/1130521 (https://phabricator.wikimedia.org/T389629) (owner: 10Filippo Giunchedi) [10:26:01] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] hieradata: bump grafana cas session timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1130522 (https://phabricator.wikimedia.org/T389629) (owner: 10Filippo Giunchedi) [10:26:01] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1125556|Migrate x2 off LB config (T383327 T387654)]], [[gerrit:1130542|etcd: Make Mainstash config global variable (T383327 T387654)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:26:35] Amir1: remember that only ms3 is set up [10:26:51] yeah [10:26:56] ms1 too [10:27:09] but we indeed need ms2 too [10:27:28] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:27:30] (if we enable data redundancy with these two sections, it'd be fine) [10:27:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s1 T389373 [10:27:52] T389373: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T389373 [10:28:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P74318 and previous config saved to /var/cache/conftool/dbconfig/20250324-102811-root.json [10:28:45] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2214.codfw.wmnet with reason: Index rebuild [10:29:37] (03CR) 10Muehlenhoff: [C:03+2] Move maps-test2001/2002 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1130539 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:29:39] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2203 with weight 0 T389373', diff saved to https://phabricator.wikimedia.org/P74319 and previous config saved to /var/cache/conftool/dbconfig/20250324-102944-fceratto.json [10:30:40] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve2005.codfw.wmnet with OS bookworm [10:31:06] !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host ml-serve2005 [10:31:19] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [10:31:48] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [10:35:38] !log installing docker.io security updates [10:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10667071 (10phaultfinder) [10:38:42] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1125556|Migrate x2 off LB config (T383327 T387654)]], [[gerrit:1130542|etcd: Make Mainstash config global variable (T383327 T387654)]] (duration: 17m 39s) [10:38:48] T383327: Re-architecture mainstash (x2) to allow easier maintenance - https://phabricator.wikimedia.org/T383327 [10:38:48] T387654: Re-evaluate whether WMF's MainStash config should use LBFactory - https://phabricator.wikimedia.org/T387654 [10:38:54] (03CR) 10Volans: "Interesting problem! :) I see various options, depending on which UX you prefer." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [10:41:41] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2005 - elukey@cumin1002" [10:41:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2005 - elukey@cumin1002" [10:41:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:41:48] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache ml-serve2005.codfw.wmnet 202.0.192.10.in-addr.arpa 2.0.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:41:51] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-serve2005.codfw.wmnet 202.0.192.10.in-addr.arpa 2.0.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:41:52] !log elukey@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve2005 [10:42:03] !log elukey@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve2005 [10:42:03] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve2005 [10:42:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2001.codfw.wmnet with OS bookworm [10:42:58] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10667098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps-test2001.codfw.wmnet with OS bookworm [10:43:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P74321 and previous config saved to /var/cache/conftool/dbconfig/20250324-104316-root.json [10:43:28] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 34 hosts with reason: Primary switchover s1 T389373 [10:43:32] T389373: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T389373 [10:43:38] (03CR) 10Sergio Gimeno: [C:03+1] [Growth] enwiki: Release Add Link to 20% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130532 (https://phabricator.wikimedia.org/T388289) (owner: 10Cyndywikime) [10:45:36] (03CR) 10Klausman: [C:03+1] changeprop: add liftwing revertrisk-language-agnostic stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130349 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [10:50:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2003.codfw.wmnet [10:53:35] (03PS1) 10Ladsgroup: beta: Fix mainstash config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130545 (https://phabricator.wikimedia.org/T387654) [10:54:26] (03CR) 10Ladsgroup: [C:03+2] beta: Fix mainstash config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130545 (https://phabricator.wikimedia.org/T387654) (owner: 10Ladsgroup) [10:55:19] (03Merged) 10jenkins-bot: beta: Fix mainstash config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130545 (https://phabricator.wikimedia.org/T387654) (owner: 10Ladsgroup) [10:55:59] (03CR) 10Marostegui: [C:03+1] mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1129293 (https://phabricator.wikimedia.org/T389373) (owner: 10Gerrit maintenance bot) [10:56:20] (03CR) 10Federico Ceratto: [C:03+1] mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1129293 (https://phabricator.wikimedia.org/T389373) (owner: 10Gerrit maintenance bot) [10:56:32] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1129293 (https://phabricator.wikimedia.org/T389373) (owner: 10Gerrit maintenance bot) [10:57:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2003.codfw.wmnet [10:58:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74322 and previous config saved to /var/cache/conftool/dbconfig/20250324-105822-root.json [10:59:12] !log Starting s1 codfw failover from db2212 to db2203 - T389373 [10:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:16] T389373: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T389373 [10:59:31] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10667188 (10MoritzMuehlenhoff) [10:59:45] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2005.codfw.wmnet with reason: host reimage [11:01:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2001.codfw.wmnet with reason: host reimage [11:03:18] (03PS1) 10Marostegui: Revert "db2214: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1130549 [11:03:19] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2005.codfw.wmnet with reason: host reimage [11:03:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2203 to s1 primary T389373', diff saved to https://phabricator.wikimedia.org/P74323 and previous config saved to /var/cache/conftool/dbconfig/20250324-110321-fceratto.json [11:03:26] (03PS1) 10Ladsgroup: beta: Fix mainstash, take II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130550 (https://phabricator.wikimedia.org/T387654) [11:03:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2004.codfw.wmnet [11:04:41] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-mariadb1001.eqiad.wmnet [11:06:00] (03CR) 10Marostegui: [C:03+2] Revert "db2214: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1130549 (owner: 10Marostegui) [11:06:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2001.codfw.wmnet with reason: host reimage [11:09:20] (03PS2) 10Ladsgroup: beta: Fix mainstash, take II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130550 (https://phabricator.wikimedia.org/T387654) [11:10:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2004.codfw.wmnet [11:10:42] (03PS3) 10Ladsgroup: beta: Fix mainstash, take II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130550 (https://phabricator.wikimedia.org/T387654) [11:10:43] (03CR) 10CI reject: [V:04-1] beta: Fix mainstash, take II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130550 (https://phabricator.wikimedia.org/T387654) (owner: 10Ladsgroup) [11:11:05] 07SRE-Unowned, 10Maps: Build and import impism 0.14.1 plus latest bugfix - https://phabricator.wikimedia.org/T389780 (10MoritzMuehlenhoff) 03NEW [11:11:24] 07SRE-Unowned, 10Maps: Build and import impism 0.14.1 plus latest bugfix - https://phabricator.wikimedia.org/T389780#10667264 (10MoritzMuehlenhoff) [11:11:30] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10667265 (10MoritzMuehlenhoff) [11:11:32] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-mariadb1001.eqiad.wmnet [11:11:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db2212 T389373', diff saved to https://phabricator.wikimedia.org/P74324 and previous config saved to /var/cache/conftool/dbconfig/20250324-111157-fceratto.json [11:12:01] T389373: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T389373 [11:12:28] (03CR) 10Ladsgroup: [C:03+2] "it's not super nice but I see a lot of Realm checks in CS.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130550 (https://phabricator.wikimedia.org/T387654) (owner: 10Ladsgroup) [11:13:15] (03Merged) 10jenkins-bot: beta: Fix mainstash, take II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130550 (https://phabricator.wikimedia.org/T387654) (owner: 10Ladsgroup) [11:13:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74325 and previous config saved to /var/cache/conftool/dbconfig/20250324-111327-root.json [11:14:37] (03PS1) 10Btullis: Revert "Temporarily disable gobblin on an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/1130556 [11:15:31] (03CR) 10Btullis: [C:03+2] Revert "Temporarily disable gobblin on an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/1130556 (owner: 10Btullis) [11:18:21] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [11:18:45] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:19:20] (03PS17) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [11:19:29] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2005.codfw.wmnet with OS bookworm [11:21:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [11:21:47] !incidents [11:21:48] 5779 (UNACKED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [11:21:48] 5778 (RESOLVED) Manual (paged) by Rae Adimer (radimer@wikimedia.org): MediaWiki internal error on loading Wikipedia [11:21:53] !ack 5779 [11:21:54] 5779 (ACKED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [11:22:40] jelto: Should we renew the silence? [11:22:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2001.codfw.wmnet with OS bookworm [11:23:10] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10667330 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps-test2001.codfw.wmnet with OS bookworm completed: - maps-test2001 (**PASS**)... [11:23:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2005.codfw.wmnet [11:23:56] I think the silence was for ml/inference services and not for "rate_limit_cluster" [11:24:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2002.codfw.wmnet with OS bookworm [11:24:18] (03PS1) 10Ladsgroup: Enable dataRedundancy for mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130558 (https://phabricator.wikimedia.org/T383327) [11:24:20] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10667334 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps-test2002.codfw.wmnet with OS bookworm [11:24:35] (03Abandoned) 10Ladsgroup: Add config needed to re-architecture mainstash away from x2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [11:24:44] it happened also on Friday, it seemed related to an ml-deployment for the api-gateway but it was reverted [11:24:46] which may be https://wikitech.wikimedia.org/wiki/Ratelimit but I'm not sure [11:25:05] (03CR) 10MVernon: [C:03+2] site/install: prep for new apus and thanos nodes [puppet] - 10https://gerrit.wikimedia.org/r/1130151 (https://phabricator.wikimedia.org/T389632) (owner: 10MVernon) [11:26:12] jelto, tappof - judging from https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57 it seems that the rate limit cluster is ending up in 504s [11:26:16] that is weird: D [11:26:34] so the ml-deployment is triggering rate limiting? Or are this two different issues? In the dashboard I can see the inference issues last week and peaks for rate_limit_cluster today [11:26:34] it is as if the rate-limit backend itself was faulty [11:26:36] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2212.codfw.wmnet [11:26:37] jouncebot: nowandnext [11:26:37] No deployments scheduled for the next 1 hour(s) and 33 minute(s) [11:26:38] In 1 hour(s) and 33 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T1300) [11:27:05] jelto: nono the deployment was related to a new api-gateway config, but it was reverted and the errors came up again [11:27:09] so likely unrelated [11:27:31] ok, I'll check the kubernetes deployment and then ask serviceops [11:28:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74326 and previous config saved to /var/cache/conftool/dbconfig/20250324-112833-root.json [11:28:45] !log installing busybox security updates [11:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:58] !log btullis@cumin1002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on A:datahubsearch [11:31:01] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2212.codfw.wmnet [11:31:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2005.codfw.wmnet [11:31:36] RESOLVED: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [11:34:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2006.codfw.wmnet [11:35:06] I'm not seeing anything particularly strange in the ratelimit deployment [11:35:30] FIRING: [6x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:38:02] !log btullis@cumin1002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:datahubsearch [11:38:10] I think the probe's a bit sensitive [11:38:23] 6% 500s and it triggers [11:40:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74327 and previous config saved to /var/cache/conftool/dbconfig/20250324-114019-root.json [11:41:18] (03PS1) 10Gergő Tisza: Do not throw an exception after shared-domain login with no token [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130560 (https://phabricator.wikimedia.org/T362715) [11:41:27] (03PS1) 10Gergő Tisza: Do not start central login from the shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130561 (https://phabricator.wikimedia.org/T362715) [11:41:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130560 (https://phabricator.wikimedia.org/T362715) (owner: 10Gergő Tisza) [11:41:43] (03PS1) 10Muehlenhoff: Remove access for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/1130562 [11:41:48] there's nothing in the logs of the ratelimit service [11:42:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130561 (https://phabricator.wikimedia.org/T362715) (owner: 10Gergő Tisza) [11:42:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2006.codfw.wmnet [11:43:18] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2002.codfw.wmnet with reason: host reimage [11:43:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74328 and previous config saved to /var/cache/conftool/dbconfig/20250324-114338-root.json [11:44:18] !log installing subversion security updates [11:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10667445 (10phaultfinder) [11:45:02] (03CR) 10Ladsgroup: [C:03+2] Enable dataRedundancy for mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130558 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [11:45:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130558 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [11:45:30] FIRING: [8x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:45:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2002.codfw.wmnet with reason: host reimage [11:46:14] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db2188.codfw.wmnet onto db2212.codfw.wmnet [11:46:25] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2212 - Depool db2188.codfw.wmnet to then clone it to db2212.codfw.wmnet - fceratto@cumin1002 [11:46:34] (03Merged) 10jenkins-bot: Enable dataRedundancy for mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130558 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [11:46:34] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2212 - Depool db2188.codfw.wmnet to then clone it to db2212.codfw.wmnet - fceratto@cumin1002 [11:46:45] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1130558|Enable dataRedundancy for mainstash (T383327)]] [11:46:49] T383327: Re-architecture mainstash (x2) to allow easier maintenance - https://phabricator.wikimedia.org/T383327 [11:47:02] jelto: I'm not seeing anything particular except for the elevated 500s, nothing in logs, redis seems fine [11:47:32] claime: there is some discussion in -sre regarding traffic from WME [11:47:38] ah [11:51:26] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1130558|Enable dataRedundancy for mainstash (T383327)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:52:51] (03CR) 10Marostegui: [C:03+1] mariadb: Promote db2209 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1129299 (https://phabricator.wikimedia.org/T389377) (owner: 10Gerrit maintenance bot) [11:54:42] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s3 T389377 [11:54:46] T389377: Switchover s3 master (db2205 -> db2209) - https://phabricator.wikimedia.org/T389377 [11:54:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2209 with weight 0 T389377', diff saved to https://phabricator.wikimedia.org/P74330 and previous config saved to /var/cache/conftool/dbconfig/20250324-115457-fceratto.json [11:55:10] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [11:55:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74331 and previous config saved to /var/cache/conftool/dbconfig/20250324-115524-root.json [11:56:08] !log temporarily bump rate-limit replicas from 3 -> 6 on Wikikube eqiad [11:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74332 and previous config saved to /var/cache/conftool/dbconfig/20250324-115843-root.json [11:59:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10667514 (10phaultfinder) [12:01:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130299 (owner: 10Bartosz Dziewoński) [12:01:49] (03CR) 10Bartosz Dziewoński: "(Scheduled for next week – after the MediaWiki change goes live)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130299 (owner: 10Bartosz Dziewoński) [12:02:28] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130558|Enable dataRedundancy for mainstash (T383327)]] (duration: 15m 43s) [12:02:33] T383327: Re-architecture mainstash (x2) to allow easier maintenance - https://phabricator.wikimedia.org/T383327 [12:03:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2002.codfw.wmnet with OS bookworm [12:03:50] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10667520 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps-test2002.codfw.wmnet with OS bookworm completed: - maps-test2002 (**PASS**)... [12:05:51] (03CR) 10Federico Ceratto: [C:03+1] mariadb: Promote db2209 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1129299 (https://phabricator.wikimedia.org/T389377) (owner: 10Gerrit maintenance bot) [12:05:54] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2209 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1129299 (https://phabricator.wikimedia.org/T389377) (owner: 10Gerrit maintenance bot) [12:08:01] (03PS1) 10Ladsgroup: Bump portal to head [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130570 [12:08:21] (03CR) 10Ladsgroup: [C:03+2] Bump portal to head [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130570 (owner: 10Ladsgroup) [12:08:24] !log Starting s3 codfw failover from db2205 to db2209 - T389377 [12:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:30] T389377: Switchover s3 master (db2205 -> db2209) - https://phabricator.wikimedia.org/T389377 [12:09:03] !log revert rate-limit replicas from 6 -> 3 on Wikikube eqiad [12:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:07] (03Merged) 10jenkins-bot: Bump portal to head [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130570 (owner: 10Ladsgroup) [12:09:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2209 to s3 primary T389377', diff saved to https://phabricator.wikimedia.org/P74333 and previous config saved to /var/cache/conftool/dbconfig/20250324-120947-fceratto.json [12:10:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74334 and previous config saved to /var/cache/conftool/dbconfig/20250324-121030-root.json [12:12:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db2205 T389377', diff saved to https://phabricator.wikimedia.org/P74335 and previous config saved to /var/cache/conftool/dbconfig/20250324-121227-fceratto.json [12:16:28] (03PS1) 10Ladsgroup: Remove x2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130573 (https://phabricator.wikimedia.org/T383327) [12:17:28] (03PS1) 10Elukey: api-gateway: set the rate-limiter's timeout to ms [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130574 [12:18:09] (03PS2) 10Elukey: api-gateway: set the rate-limiter's timeout to ms [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130574 [12:18:57] (03CR) 10Elukey: "This is for sure not the issue, but after reading https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/ratelimit/v3/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130574 (owner: 10Elukey) [12:19:09] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2205.codfw.wmnet [12:22:02] !log ladsgroup@deploy1003 Synchronized portals/wikipedia.org/assets: Minor wikimedia.org mobile fixes (T373204) (duration: 11m 37s) [12:22:08] T373204: Wikimedia.org page redesign - https://phabricator.wikimedia.org/T373204 [12:23:48] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2205.codfw.wmnet [12:24:52] !log ladsgroup@deploy1003 Synchronized portals: Minor wikimedia.org mobile fixes (T373204) (duration: 02m 48s) [12:25:01] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1130562 (owner: 10Muehlenhoff) [12:25:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74336 and previous config saved to /var/cache/conftool/dbconfig/20250324-122535-root.json [12:25:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10667623 (10phaultfinder) [12:26:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126548 (https://phabricator.wikimedia.org/T387573) (owner: 10Ladsgroup) [12:26:22] (03CR) 10Clément Goubert: [C:03+1] api-gateway: p.age on high errors, alert on lower [alerts] - 10https://gerrit.wikimedia.org/r/1127564 (owner: 10Hnowlan) [12:27:00] (03CR) 10Ladsgroup: [C:04-2] Switch the footer link to wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126548 (https://phabricator.wikimedia.org/T387573) (owner: 10Ladsgroup) [12:27:34] (03PS2) 10Ladsgroup: Switch the footer link to wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126548 (https://phabricator.wikimedia.org/T387573) [12:27:52] (03CR) 10Jelto: [C:03+1] "looks reasonable to me, the current `GatewayBackendErrorsHigh` alert is quite noisy" [alerts] - 10https://gerrit.wikimedia.org/r/1127564 (owner: 10Hnowlan) [12:27:54] (03CR) 10Ladsgroup: [C:03+1] "Added www. to avoid extra redirect being issued non-stop." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126548 (https://phabricator.wikimedia.org/T387573) (owner: 10Ladsgroup) [12:28:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126548 (https://phabricator.wikimedia.org/T387573) (owner: 10Ladsgroup) [12:29:09] (03Merged) 10jenkins-bot: Switch the footer link to wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126548 (https://phabricator.wikimedia.org/T387573) (owner: 10Ladsgroup) [12:29:21] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1126548|Switch the footer link to wikimedia.org (T387573 T373204)]] [12:29:26] T387573: Switch link of footer from wikimediafoundation.org to wikimedia.org - https://phabricator.wikimedia.org/T387573 [12:29:27] T373204: Wikimedia.org page redesign - https://phabricator.wikimedia.org/T373204 [12:31:16] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db2227.codfw.wmnet onto db2205.codfw.wmnet [12:31:28] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786#10667658 (10Clement_Goubert) 05Open→03In progress [12:31:47] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2205 - Depool db2227.codfw.wmnet to then clone it to db2205.codfw.wmnet - fceratto@cumin1002 [12:31:55] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2205 - Depool db2227.codfw.wmnet to then clone it to db2205.codfw.wmnet - fceratto@cumin1002 [12:33:52] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1126548|Switch the footer link to wikimedia.org (T387573 T373204)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:34:28] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [12:34:58] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10667666 (10MatthewVernon) 05Open→03Resolved OK, I understand now, this system gets `perccli` rather than `megacli`, bu... [12:37:32] (03CR) 10Thiemo Kreuz (WMDE): "The code looks good. Except that the "labs" file is only for the beta cluster. Is this intentional?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde) [12:38:41] (03PS2) 10Chuckonwumelu: Add Chuck key [labs/private] - 10https://gerrit.wikimedia.org/r/1129595 [12:39:38] (03CR) 10Chuckonwumelu: "Expanded on commit message" [labs/private] - 10https://gerrit.wikimedia.org/r/1129595 (owner: 10Chuckonwumelu) [12:40:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74337 and previous config saved to /var/cache/conftool/dbconfig/20250324-124041-root.json [12:41:09] (03CR) 10Hashar: "I will deploy that on Tuesday March 24 during the UTC morning backport window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127886 (https://phabricator.wikimedia.org/T297863) (owner: 10Hashar) [12:41:46] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1126548|Switch the footer link to wikimedia.org (T387573 T373204)]] (duration: 12m 25s) [12:41:51] T387573: Switch link of footer from wikimediafoundation.org to wikimedia.org - https://phabricator.wikimedia.org/T387573 [12:41:51] T373204: Wikimedia.org page redesign - https://phabricator.wikimedia.org/T373204 [12:43:06] (03CR) 10Jelto: [C:03+2] api-gateway: p.age on high errors, alert on lower [alerts] - 10https://gerrit.wikimedia.org/r/1127564 (owner: 10Hnowlan) [12:44:20] (03Merged) 10jenkins-bot: api-gateway: p.age on high errors, alert on lower [alerts] - 10https://gerrit.wikimedia.org/r/1127564 (owner: 10Hnowlan) [12:45:20] (03PS1) 10Kevin Bazira: ml-services: update rrla staging image and env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130579 (https://phabricator.wikimedia.org/T326179) [12:50:21] (03CR) 10Ladsgroup: [C:03+2] Remove x2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130573 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [12:50:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130573 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [12:51:09] (03Merged) 10jenkins-bot: Remove x2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130573 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [12:51:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130532 (https://phabricator.wikimedia.org/T388289) (owner: 10Cyndywikime) [12:51:24] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1130573|Remove x2 (T383327)]] [12:51:28] T383327: Re-architecture mainstash (x2) to allow easier maintenance - https://phabricator.wikimedia.org/T383327 [12:55:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10667748 (10phaultfinder) [12:55:58] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1130573|Remove x2 (T383327)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:57:23] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [12:57:35] (03CR) 10Muehlenhoff: [C:03+2] Switch maps-test2001 to master_bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1130540 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T1300). [13:00:05] tgr, sfaci, cwhite, stephanebisson, Reedy, cscott, and Cyndywikime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] o/ [13:00:23] o/ [13:00:36] I'm around to deploy if necessary... [13:00:46] o/ [13:00:55] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2215 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1130590 (https://phabricator.wikimedia.org/T389795) [13:00:58] I can deploy, these days half the patches are mine anyway. [13:01:01] I'm around as well (mostly to test stephanebisson's patch :)) [13:01:15] I am around [13:01:23] we'll have to batch things because there's a lot of patches [13:01:54] who doesn't need to test / thinks it's unlikely the test will find anything wrong? [13:02:08] me [13:02:57] me. The patch is really easy. It's just about disabling something [13:03:29] But I can wait as well if necessary [13:03:32] cwhite: cscott: around for the deployment? [13:03:41] tgr_: Mine can just go out too. It's just in prep for the CN patch landing at some point in the future [13:03:53] so functionally a noop atm [13:04:31] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130573|Remove x2 (T383327)]] (duration: 13m 07s) [13:04:36] T383327: Re-architecture mainstash (x2) to allow easier maintenance - https://phabricator.wikimedia.org/T383327 [13:05:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129229 (https://phabricator.wikimedia.org/T389348) (owner: 10Reedy) [13:05:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130532 (https://phabricator.wikimedia.org/T388289) (owner: 10Cyndywikime) [13:05:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130128 (https://phabricator.wikimedia.org/T383801) (owner: 10Santiago Faci) [13:05:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130169 (https://phabricator.wikimedia.org/T387821) (owner: 10Sbisson) [13:05:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10667803 (10phaultfinder) [13:06:09] (03Merged) 10jenkins-bot: CommonSettings: Migrate CentralNotice to Virtual Domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129229 (https://phabricator.wikimedia.org/T389348) (owner: 10Reedy) [13:06:14] (03Merged) 10jenkins-bot: [Growth] enwiki: Release Add Link to 20% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130532 (https://phabricator.wikimedia.org/T388289) (owner: 10Cyndywikime) [13:06:16] (03Merged) 10jenkins-bot: [Experiment Platform] Disable test experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130128 (https://phabricator.wikimedia.org/T383801) (owner: 10Santiago Faci) [13:06:18] (03Merged) 10jenkins-bot: Enable Section Translation and Unified Dashboard on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130169 (https://phabricator.wikimedia.org/T387821) (owner: 10Sbisson) [13:06:33] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1129229|CommonSettings: Migrate CentralNotice to Virtual Domains (T389348)]], [[gerrit:1130532|[Growth] enwiki: Release Add Link to 20% of newcomers (T388289)]], [[gerrit:1130128|[Experiment Platform] Disable test experiment (T383801)]], [[gerrit:1130169|Enable Section Translation and Unified Dashboard on all wikipedias (T387821)]] [13:06:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2001.codfw.wmnet [13:06:41] T389348: Migrate CentralNotice to virtual domains - https://phabricator.wikimedia.org/T389348 [13:06:42] T388289: Add a link (Structured task): Increase rollout on English Wikipedia to 20% - https://phabricator.wikimedia.org/T388289 [13:06:42] T383801: Remove Experimentation Lab's first test experiment - https://phabricator.wikimedia.org/T383801 [13:06:42] T387821: Deploy unified dashboard on more wikis (phase 3) - https://phabricator.wikimedia.org/T387821 [13:11:20] !log tgr@deploy1003 reedy, sfaci, tgr, cyndywikime, sbisson: Backport for [[gerrit:1129229|CommonSettings: Migrate CentralNotice to Virtual Domains (T389348)]], [[gerrit:1130532|[Growth] enwiki: Release Add Link to 20% of newcomers (T388289)]], [[gerrit:1130128|[Experiment Platform] Disable test experiment (T383801)]], [[gerrit:1130169|Enable Section Translation and Unified Dashboard on all wikipedias (T387821)]] synced t [13:11:20] o the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:11:29] tgr_: I'm here, sorry a bit late [13:11:39] (03CR) 10Volans: "Nice addition, we can surely simplify the workflow with this. I left some suggestions inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey) [13:12:09] (03PS10) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [13:12:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2001.codfw.wmnet [13:12:39] stephanebisson: do you want to test it? [13:12:50] tgr_ yes, testing it right now [13:12:55] (03PS1) 10MVernon: swift: add ms-be2089 to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1130592 (https://phabricator.wikimedia.org/T388221) [13:13:24] stephanebisson: I tested in guwiki and looks good. Testing with Wiki with SX not enabled earlier as well. [13:14:02] tgr_: in theory mine could be batched: if we find issues it's not likely to be immediately, it will be some random but important page on a small wiki which regresses, which an editor will complain about in a couple of days. But I do have tests I can run. (Looks like I missed the batch anyway.) [13:14:29] Tested! The thing we wanted to disable is no longer running there. Thank you very much!! [13:14:31] (03CR) 10Muehlenhoff: [C:03+2] Move maps-test2002 to replica_bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1130541 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:14:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10667852 (10phaultfinder) [13:15:04] (03CR) 10Muehlenhoff: [C:03+2] Move maps-test2002 to replica_bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1130541 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:15:34] stephanebisson: hiwiki also good (no SX earlier there) [13:15:51] kart_ LGTM [13:16:02] !log tgr@deploy1003 reedy, sfaci, tgr, cyndywikime, sbisson: Continuing with sync [13:16:20] :thumbs [13:17:03] (03CR) 10Jgleeson: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127890 (https://phabricator.wikimedia.org/T232912) (owner: 10Hashar) [13:17:06] (03CR) 10Volans: [C:03+2] spicerack: convert some @property into methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/1129375 (https://phabricator.wikimedia.org/T389329) (owner: 10Volans) [13:17:12] cscott: could batch it with some of the backports. Doesn't it affect the parser cache key though? That makes the patch nontrivial by default IMO. [13:17:44] (03CR) 10Jgleeson: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127889 (owner: 10Hashar) [13:18:17] No, it just affects how parsoid parses certain pages with parser functions. But that does affect what gets stuck into the parser cache as output, so if you want to deploy it by itself that's fine too. [13:18:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:19:03] as long as it doesn't result in new cache misses I think that's fine, will deploy it together then [13:19:33] Wfm [13:19:43] (just being wary of the thing that happened last time with the huge CPU spike) [13:20:22] I'm trying to remember what that was. [13:21:53] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:mirrors add file age exporter [puppet] - 10https://gerrit.wikimedia.org/r/1129827 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:22:25] this one: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1124874 [13:23:19] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1129229|CommonSettings: Migrate CentralNotice to Virtual Domains (T389348)]], [[gerrit:1130532|[Growth] enwiki: Release Add Link to 20% of newcomers (T388289)]], [[gerrit:1130128|[Experiment Platform] Disable test experiment (T383801)]], [[gerrit:1130169|Enable Section Translation and Unified Dashboard on all wikipedias (T387821)]] (duration: 16m 45s) [13:23:26] T389348: Migrate CentralNotice to virtual domains - https://phabricator.wikimedia.org/T389348 [13:23:26] T388289: Add a link (Structured task): Increase rollout on English Wikipedia to 20% - https://phabricator.wikimedia.org/T388289 [13:23:27] T383801: Remove Experimentation Lab's first test experiment - https://phabricator.wikimedia.org/T383801 [13:23:27] T387821: Deploy unified dashboard on more wikis (phase 3) - https://phabricator.wikimedia.org/T387821 [13:23:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:24:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:24:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130343 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian) [13:24:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130560 (https://phabricator.wikimedia.org/T362715) (owner: 10Gergő Tisza) [13:24:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130561 (https://phabricator.wikimedia.org/T362715) (owner: 10Gergő Tisza) [13:25:10] tgr_: oh that was while I was on vacation. That must have been fun -- we usually assume mobile is pre warming all our parsoid caches for us but I guess mobile doesn't hit Wiktionary enough. [13:25:25] (03CR) 10Eevans: [C:03+1] swift: add ms-be2089 to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1130592 (https://phabricator.wikimedia.org/T388221) (owner: 10MVernon) [13:25:45] (03Merged) 10jenkins-bot: Turn on Parsoid fragment support everywhere (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130343 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian) [13:26:03] (03PS1) 10Bartosz Dziewoński: Fix clearing stuck 'UserID' and 'UserName' cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130593 (https://phabricator.wikimedia.org/T389796) [13:26:15] (03Merged) 10jenkins-bot: Do not throw an exception after shared-domain login with no token [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130560 (https://phabricator.wikimedia.org/T362715) (owner: 10Gergő Tisza) [13:26:24] (03CR) 10MVernon: [C:03+2] swift: add ms-be2089 to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1130592 (https://phabricator.wikimedia.org/T388221) (owner: 10MVernon) [13:26:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:27:30] (03PS1) 10Slyngshede: P:firewall remove unused code [puppet] - 10https://gerrit.wikimedia.org/r/1130594 (https://phabricator.wikimedia.org/T350694) [13:27:37] (03Merged) 10jenkins-bot: spicerack: convert some @property into methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/1129375 (https://phabricator.wikimedia.org/T389329) (owner: 10Volans) [13:29:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10667935 (10phaultfinder) [13:29:39] tgr_: I'm around for the config patch. [13:30:38] (03PS1) 10Bartosz Dziewoński: Restore deprecated aliases for CommentStoreComment and RawMessage [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130596 (https://phabricator.wikimedia.org/T388725) [13:30:56] (03PS1) 10Klausman: role::ml_k8s::worker: move ml-serv2011 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1130595 (https://phabricator.wikimedia.org/T387854) [13:31:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:32:07] cscott: is wiktionary even served by mobile? (i suppose you mean mobile app). on android i get the impression that not, but maybe i'm mistaken [13:32:22] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1015.eqiad.wmnet [13:33:05] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [labs/private] - 10https://gerrit.wikimedia.org/r/1129595 (owner: 10Chuckonwumelu) [13:33:08] yeah, that might have been the reason! i don't know, i wasn't here to postmortem it. ;-p but there's a "prewarm cache" job in core which *should* have ensured a warm parsoid parsercache, my guess would be that its configuration has drifted or it was turned off, intentionally or accidentally. [13:33:11] (03Merged) 10jenkins-bot: Do not start central login from the shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130561 (https://phabricator.wikimedia.org/T362715) (owner: 10Gergő Tisza) [13:33:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 13 hosts with reason: Primary switchover x1 T389795 [13:33:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2215 with weight 0 T389795', diff saved to https://phabricator.wikimedia.org/P74338 and previous config saved to /var/cache/conftool/dbconfig/20250324-133320-marostegui.json [13:33:24] T389795: Switchover x1 master (db2196 -> db2215) - https://phabricator.wikimedia.org/T389795 [13:33:27] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1130343|Turn on Parsoid fragment support everywhere (take 2) (T374661 T380758 T389545 T387608)]], [[gerrit:1130560|Do not throw an exception after shared-domain login with no token (T362715)]], [[gerrit:1130561|Do not start central login from the shared domain (T362715)]] [13:33:37] T374661: Charts are not compatible with Parsoid - show as raw SVG - https://phabricator.wikimedia.org/T374661 [13:33:37] T380758: Turn on Parsoid Fragments support everywhere - https://phabricator.wikimedia.org/T380758 [13:33:38] T389545: Variables in Parsoid content show older values compared to legacy - https://phabricator.wikimedia.org/T389545 [13:33:38] T387608: Parsoid's Fragment Mode support doesn't process strip markers recursively inside StripMarker::split - https://phabricator.wikimedia.org/T387608 [13:33:38] T362715: Move credentials change to central login wiki - https://phabricator.wikimedia.org/T362715 [13:33:56] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2215 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1130590 (https://phabricator.wikimedia.org/T389795) (owner: 10Gerrit maintenance bot) [13:33:58] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:34:04] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] Add Chuck key [labs/private] - 10https://gerrit.wikimedia.org/r/1129595 (owner: 10Chuckonwumelu) [13:34:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130593 (https://phabricator.wikimedia.org/T389796) (owner: 10Bartosz Dziewoński) [13:35:09] (03CR) 10Arturo Borrero Gonzalez: [V:03+2 C:03+2] Add Chuck key [labs/private] - 10https://gerrit.wikimedia.org/r/1129595 (owner: 10Chuckonwumelu) [13:37:04] !log tgr@deploy1003 tgr, cscott: Backport for [[gerrit:1130343|Turn on Parsoid fragment support everywhere (take 2) (T374661 T380758 T389545 T387608)]], [[gerrit:1130560|Do not throw an exception after shared-domain login with no token (T362715)]], [[gerrit:1130561|Do not start central login from the shared domain (T362715)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:37:14] ok, taking a look [13:38:53] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1015.eqiad.wmnet [13:39:11] (03PS1) 10Hashar: gerrit: raise heap limit from 32g to 64g [puppet] - 10https://gerrit.wikimedia.org/r/1130597 [13:39:38] (03PS2) 10Hashar: gerrit: raise heap limit from 32g to 64g [puppet] - 10https://gerrit.wikimedia.org/r/1130597 (https://phabricator.wikimedia.org/T387223) [13:39:41] (03Abandoned) 10Slyngshede: prometheus::node_exporter: allow users to update files they own [puppet] - 10https://gerrit.wikimedia.org/r/978049 (owner: 10Jbond) [13:40:22] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [13:40:30] FIRING: [10x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:57] (03Abandoned) 10Slyngshede: P:mirrors::debian Export mirror age to textfile exporter [puppet] - 10https://gerrit.wikimedia.org/r/1003442 (owner: 10Slyngshede) [13:41:02] testing looks good so far, just a few more pages to check [13:43:30] !log Starting x1 codfw failover from db2196 to db2215 - T389795 [13:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:34] T389795: Switchover x1 master (db2196 -> db2215) - https://phabricator.wikimedia.org/T389795 [13:43:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2215 to x1 primary T389795', diff saved to https://phabricator.wikimedia.org/P74339 and previous config saved to /var/cache/conftool/dbconfig/20250324-134356-marostegui.json [13:45:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2196 T389795', diff saved to https://phabricator.wikimedia.org/P74340 and previous config saved to /var/cache/conftool/dbconfig/20250324-134500-marostegui.json [13:45:25] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2196.codfw.wmnet [13:45:30] FIRING: [10x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:46:22] tgr_: looks good [13:46:41] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [13:46:52] !log tgr@deploy1003 tgr, cscott: Continuing with sync [13:48:22] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1130595 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [13:49:16] jouncebot: now and next [13:49:16] For the next 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T1300) [13:49:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2196.codfw.wmnet [13:49:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:50:18] (03PS1) 10Marostegui: mariadb: Move hosts to ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1130600 (https://phabricator.wikimedia.org/T387332) [13:50:30] FIRING: [10x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:50:32] (03CR) 10Filippo Giunchedi: [C:03+1] "\o/ \o/" [puppet] - 10https://gerrit.wikimedia.org/r/1130594 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:50:46] godog: we'll overrun the window quite a bit [13:50:55] but can stop here if there's something more important [13:51:04] (03CR) 10Slyngshede: [C:03+2] P:firewall remove unused code [puppet] - 10https://gerrit.wikimedia.org/r/1130594 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:51:48] (03PS1) 10Volans: setup.py: update prospector pin [software/spicerack] - 10https://gerrit.wikimedia.org/r/1130601 [13:52:18] tgr_: ack thank you, no worries no [13:52:34] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1130595 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [13:52:54] (03PS2) 10Marostegui: mariadb: Move hosts to ms2 [puppet] - 10https://gerrit.wikimedia.org/r/1130600 (https://phabricator.wikimedia.org/T387332) [13:53:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:54:00] (03CR) 10Klausman: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130595 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [13:54:09] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130343|Turn on Parsoid fragment support everywhere (take 2) (T374661 T380758 T389545 T387608)]], [[gerrit:1130560|Do not throw an exception after shared-domain login with no token (T362715)]], [[gerrit:1130561|Do not start central login from the shared domain (T362715)]] (duration: 20m 42s) [13:54:18] T374661: Charts are not compatible with Parsoid - show as raw SVG - https://phabricator.wikimedia.org/T374661 [13:54:18] T380758: Turn on Parsoid Fragments support everywhere - https://phabricator.wikimedia.org/T380758 [13:54:18] T389545: Variables in Parsoid content show older values compared to legacy - https://phabricator.wikimedia.org/T389545 [13:54:19] T387608: Parsoid's Fragment Mode support doesn't process strip markers recursively inside StripMarker::split - https://phabricator.wikimedia.org/T387608 [13:54:20] T362715: Move credentials change to central login wiki - https://phabricator.wikimedia.org/T362715 [13:55:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10668132 (10phaultfinder) [13:56:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130166 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite) [13:56:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130593 (https://phabricator.wikimedia.org/T389796) (owner: 10Bartosz Dziewoński) [13:56:11] (03CR) 10Ladsgroup: [C:03+1] "Did we add ms2 to dbctl valid sections?" [puppet] - 10https://gerrit.wikimedia.org/r/1130600 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [13:56:23] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2011.codfw.wmnet with OS bookworm [13:56:33] !log klausman@cumin2002 START - Cookbook sre.hosts.move-vlan for host ml-serve2011 [13:56:33] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve2011 [13:56:51] (03Merged) 10jenkins-bot: bugfix: add back missing pipe char to conform to dogstatsd spec [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130166 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite) [13:56:55] !log updated cuimin to v5.1.1 on cumin1002 [13:56:58] (03PS18) 10Elukey: sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [13:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:00] (03CR) 10Marostegui: "yes, at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1122945" [puppet] - 10https://gerrit.wikimedia.org/r/1130600 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [13:57:01] (03CR) 10Elukey: sre.hosts.provision: try Supermicro BMC passwords automatically (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey) [13:57:05] (03Merged) 10jenkins-bot: Fix clearing stuck 'UserID' and 'UserName' cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130593 (https://phabricator.wikimedia.org/T389796) (owner: 10Bartosz Dziewoński) [13:57:11] (03Abandoned) 10Aqu: WIP Analytics: Depecate wmf.webrequest data purge [puppet] - 10https://gerrit.wikimedia.org/r/1130137 (https://phabricator.wikimedia.org/T387750) (owner: 10Aqu) [13:57:20] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1130166|bugfix: add back missing pipe char to conform to dogstatsd spec (T359385)]], [[gerrit:1130593|Fix clearing stuck 'UserID' and 'UserName' cookies on Wikitech (T389796)]] [13:57:25] T359385: Migrate MediaWiki.arclamp to statslib - https://phabricator.wikimedia.org/T359385 [13:57:25] T389796: Unable to autologin on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T389796 [13:57:27] (03CR) 10Ladsgroup: [C:03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/1130600 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [13:57:44] (03CR) 10Klausman: [V:03+1 C:03+2] role::ml_k8s::worker: move ml-serv2011 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1130595 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [13:58:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10668158 (10bking) [13:59:39] FIRING: ErrorBudgetBurn: wdqs - wdqs-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:59:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:59:57] (03PS7) 10Ayounsi: Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) [13:59:57] (03PS3) 10Ayounsi: Add transit/peering in/out port saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/1128429 (https://phabricator.wikimedia.org/T384052) [14:00:38] (03Abandoned) 10Zoe: Re-enable creation of Flow pages for sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129795 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [14:01:14] (03CR) 10Ayounsi: Duplicate LibreNMS In/out interface errors (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [14:01:26] (03CR) 10Elukey: [C:03+1] setup.py: update prospector pin [software/spicerack] - 10https://gerrit.wikimedia.org/r/1130601 (owner: 10Volans) [14:01:43] !log depooling wdqs1012 (catching up lag) [14:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:51] (03CR) 10CI reject: [V:04-1] setup.py: update prospector pin [software/spicerack] - 10https://gerrit.wikimedia.org/r/1130601 (owner: 10Volans) [14:01:54] !log tgr@deploy1003 tgr, matmarex, cwhite: Backport for [[gerrit:1130166|bugfix: add back missing pipe char to conform to dogstatsd spec (T359385)]], [[gerrit:1130593|Fix clearing stuck 'UserID' and 'UserName' cookies on Wikitech (T389796)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:01:57] (03CR) 10Ayounsi: [C:03+2] Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [14:02:00] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2188 slowly with 10 steps - Pool db2188.codfw.wmnet in after cloning [14:02:03] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db2188 slowly with 10 steps - Pool db2188.codfw.wmnet in after cloning [14:02:05] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2188.codfw.wmnet onto db2212.codfw.wmnet [14:02:07] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5146/console" [puppet] - 10https://gerrit.wikimedia.org/r/1130595 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [14:02:09] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2227.codfw.wmnet onto db2205.codfw.wmnet [14:02:31] i can test it [14:03:10] Mine is ready to go. Thank you! [14:03:12] (03Merged) 10jenkins-bot: Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [14:04:36] hmm, i'm not seeing the expected behavior [14:04:45] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey) [14:05:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2212.codfw.wmnet with reason: Index rebuild [14:06:02] tgr_: the fix in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1130593 is not working for me [14:06:23] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2188 gradually with 4 steps - Pooling in after cloning [14:06:26] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2188 gradually with 4 steps - Pooling in after cloning [14:06:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P74341 and previous config saved to /var/cache/conftool/dbconfig/20250324-140633-root.json [14:06:52] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2227 slowly with 10 steps - Pooling in after cloning [14:06:56] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2227 slowly with 10 steps - Pooling in after cloning [14:07:00] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1130601 (owner: 10Volans) [14:07:04] i don't think we have to revert it, but i guess i'll need to keep debugging to understand why [14:07:12] ack, thanks [14:07:15] !log tgr@deploy1003 tgr, matmarex, cwhite: Continuing with sync [14:08:40] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [14:09:42] tgr_: oh… the default value before dynamic defaults is `false`, not `null` 🤦‍♂️ [14:10:34] MatmaRex: do you want to try to fix? there's one more patch to deploy anyway [14:10:43] (03CR) 10Ayounsi: [C:03+2] Add transit/peering in/out port saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/1128429 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [14:11:45] (03PS1) 10Bartosz Dziewoński: Fix clearing stuck cookies: $wgCookiePrefix defaults to false, not null [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130607 (https://phabricator.wikimedia.org/T389796) [14:11:46] tgr_: yeah ^ [14:12:04] (03Merged) 10jenkins-bot: Add transit/peering in/out port saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/1128429 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [14:12:25] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2011.codfw.wmnet with reason: host reimage [14:12:50] MatmaRex: can you add an extra parenthesis there? [14:12:52] (03PS1) 10Ayounsi: Revert "Add transit/peering in/out port saturation alert" [alerts] - 10https://gerrit.wikimedia.org/r/1130608 [14:13:14] I'm sure it works but more mental effort to read [14:13:17] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for elastic - jclark@cumin1002" [14:13:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for elastic - jclark@cumin1002" [14:13:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:14:37] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130166|bugfix: add back missing pipe char to conform to dogstatsd spec (T359385)]], [[gerrit:1130593|Fix clearing stuck 'UserID' and 'UserName' cookies on Wikitech (T389796)]] (duration: 17m 16s) [14:14:42] T359385: Migrate MediaWiki.arclamp to statslib - https://phabricator.wikimedia.org/T359385 [14:14:43] T389796: Unable to autologin on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T389796 [14:15:12] (03CR) 10Ayounsi: [C:03+2] Revert "Add transit/peering in/out port saturation alert" [alerts] - 10https://gerrit.wikimedia.org/r/1130608 (owner: 10Ayounsi) [14:15:52] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2011.codfw.wmnet with reason: host reimage [14:16:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130607 (https://phabricator.wikimedia.org/T389796) (owner: 10Bartosz Dziewoński) [14:16:18] (03PS1) 10Btullis: Remove Hadoop worker specific disk checks [alerts] - 10https://gerrit.wikimedia.org/r/1130609 (https://phabricator.wikimedia.org/T389466) [14:16:26] (03Merged) 10jenkins-bot: Revert "Add transit/peering in/out port saturation alert" [alerts] - 10https://gerrit.wikimedia.org/r/1130608 (owner: 10Ayounsi) [14:16:29] (03PS1) 10Ssingh: admin: add bd808 to release-engineering [puppet] - 10https://gerrit.wikimedia.org/r/1130610 (https://phabricator.wikimedia.org/T389699) [14:17:38] tgr_: yeah sorry, i looked away for a minute [14:18:06] (03PS2) 10Bartosz Dziewoński: Fix clearing stuck cookies: $wgCookiePrefix defaults to false, not null [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130607 (https://phabricator.wikimedia.org/T389796) [14:18:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130319 (https://phabricator.wikimedia.org/T362715) (owner: 10Gergő Tisza) [14:18:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130607 (https://phabricator.wikimedia.org/T389796) (owner: 10Bartosz Dziewoński) [14:18:55] (03CR) 10Ssingh: "Self-merging as part of clinic duty; approval on task T389699." [puppet] - 10https://gerrit.wikimedia.org/r/1130610 (https://phabricator.wikimedia.org/T389699) (owner: 10Ssingh) [14:19:29] (03CR) 10Ssingh: [C:03+2] admin: add bd808 to release-engineering [puppet] - 10https://gerrit.wikimedia.org/r/1130610 (https://phabricator.wikimedia.org/T389699) (owner: 10Ssingh) [14:19:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:20:30] FIRING: [6x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:23] (03Merged) 10jenkins-bot: Fix clearing stuck cookies: $wgCookiePrefix defaults to false, not null [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130607 (https://phabricator.wikimedia.org/T389796) (owner: 10Bartosz Dziewoński) [14:21:29] (03PS1) 10Muehlenhoff: postgresl/osm_master: Make postgresql listen on ipv4 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1130611 (https://phabricator.wikimedia.org/T381565) [14:21:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P74342 and previous config saved to /var/cache/conftool/dbconfig/20250324-142139-root.json [14:22:13] (03CR) 10CI reject: [V:04-1] postgresl/osm_master: Make postgresql listen on ipv4 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1130611 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:22:53] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to `releng` & `gerritadmin` for bd808 - https://phabricator.wikimedia.org/T389699#10668351 (10ssingh) 05Open→03Resolved a:03ssingh Added to the two LDAP groups and also to `data.yaml` in Puppet as well (`release-engineering`). Please... [14:23:51] (03CR) 10Volans: [C:03+2] setup.py: update prospector pin [software/spicerack] - 10https://gerrit.wikimedia.org/r/1130601 (owner: 10Volans) [14:23:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:24:32] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T389537#10668384 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated psu1 and alert cleared from server. cleared from promethus. [14:25:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10668390 (10bking) a:05bking→03None [14:25:30] FIRING: [6x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130345 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [14:28:05] (03PS8) 10Abijeet Patro: AX: Add quick survey for MinT for Wikireaders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126617 (https://phabricator.wikimedia.org/T381886) [14:28:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126617 (https://phabricator.wikimedia.org/T381886) (owner: 10Abijeet Patro) [14:28:58] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:30:01] (03CR) 10Muehlenhoff: [C:03+2] Remove access for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/1130562 (owner: 10Muehlenhoff) [14:30:30] RESOLVED: [6x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:31:29] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2011.codfw.wmnet with OS bookworm [14:31:36] (03Merged) 10jenkins-bot: Redirect credentials change pages to central domain [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130319 (https://phabricator.wikimedia.org/T362715) (owner: 10Gergő Tisza) [14:31:52] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1130319|Redirect credentials change pages to central domain (T362715)]], [[gerrit:1130607|Fix clearing stuck cookies: $wgCookiePrefix defaults to false, not null (T389796)]] [14:31:57] T362715: Move credentials change to central login wiki - https://phabricator.wikimedia.org/T362715 [14:31:58] T389796: Unable to autologin on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T389796 [14:32:04] (03PS12) 10Tiziano Fogli: sre.puppet.sync-netbox-hiera: add rack/row to network_devices [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) [14:32:47] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:33:58] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:34:16] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Aitolkyn out of all services on: 1310 hosts [14:34:41] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host relforge1008 [14:34:43] (03PS1) 10Cwhite: beta-logs: fix puppet failure on collector hosts [puppet] - 10https://gerrit.wikimedia.org/r/1130615 (https://phabricator.wikimedia.org/T384335) [14:34:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:34:58] (03PS2) 10Muehlenhoff: postgresl/osm_master: Make postgresql listen on ipv4 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1130611 (https://phabricator.wikimedia.org/T381565) [14:35:36] (03Merged) 10jenkins-bot: setup.py: update prospector pin [software/spicerack] - 10https://gerrit.wikimedia.org/r/1130601 (owner: 10Volans) [14:35:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host relforge1008 [14:35:49] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Aitolkyn out of all services on: 953 hosts [14:36:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74344 and previous config saved to /var/cache/conftool/dbconfig/20250324-143644-root.json [14:36:49] !log tgr@deploy1003 matmarex, tgr: Backport for [[gerrit:1130319|Redirect credentials change pages to central domain (T362715)]], [[gerrit:1130607|Fix clearing stuck cookies: $wgCookiePrefix defaults to false, not null (T389796)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:36:51] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-ctrl2005 to codfw - jhancock@cumin2002" [14:36:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-ctrl2005 to codfw - jhancock@cumin2002" [14:36:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:36:58] alright, here goes [14:37:28] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl2005 [14:37:30] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host relforge1009 [14:37:30] tgr_: works perfectly now [14:37:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl2005 [14:38:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:38:26] yay! [14:38:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host relforge1009 [14:38:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2254.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:38:58] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:39:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2254.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:39:14] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host relforge1010 [14:39:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host relforge1010 [14:39:29] (03CR) 10Cwhite: [C:03+2] beta-logs: fix puppet failure on collector hosts [puppet] - 10https://gerrit.wikimedia.org/r/1130615 (https://phabricator.wikimedia.org/T384335) (owner: 10Cwhite) [14:41:23] (03PS19) 10Elukey: sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [14:41:36] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host relforge1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:41:39] (03CR) 10Elukey: "Need to check the new version with test-cookbook :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey) [14:42:01] (03PS1) 10Btullis: Remove the cloudnativepg backup export checks [alerts] - 10https://gerrit.wikimedia.org/r/1130619 (https://phabricator.wikimedia.org/T389466) [14:42:16] (i'm away for a bit) [14:43:04] !log joal@deploy1003 Started deploy [analytics/refinery@e0320e1]: Regular analytics weekly train [analytics/refinery@e0320e14] [14:43:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2254.codfw.wmnet with OS bookworm [14:43:32] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10668543 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2254.codfw.wmnet with... [14:43:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:44:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2240 with weight 0 T389378', diff saved to https://phabricator.wikimedia.org/P74346 and previous config saved to /var/cache/conftool/dbconfig/20250324-144410-marostegui.json [14:44:14] T389378: Switchover s4 master (db2179 -> db2240) - https://phabricator.wikimedia.org/T389378 [14:44:17] !log tgr@deploy1003 matmarex, tgr: Continuing with sync [14:44:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:44:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s4 T389378 [14:44:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2240 from API/vslow/dump T389378', diff saved to https://phabricator.wikimedia.org/P74347 and previous config saved to /var/cache/conftool/dbconfig/20250324-144440-marostegui.json [14:45:02] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2240 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1129301 (https://phabricator.wikimedia.org/T389378) (owner: 10Gerrit maintenance bot) [14:45:34] (03PS1) 10Bking: cloudelastic: DO NOT MERGE (just for PCC) [puppet] - 10https://gerrit.wikimedia.org/r/1130624 (https://phabricator.wikimedia.org/T383811) [14:45:47] !log joal@deploy1003 Finished deploy [analytics/refinery@e0320e1]: Regular analytics weekly train [analytics/refinery@e0320e14] (duration: 02m 42s) [14:45:55] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130624 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking) [14:46:14] !log joal@deploy1003 Started deploy [analytics/refinery@e0320e1] (thin): Regular analytics weekly train THIN [analytics/refinery@e0320e14] [14:46:18] (03PS1) 10Ayounsi: Add transit/peering in/out port saturation alert - try 2 [alerts] - 10https://gerrit.wikimedia.org/r/1130625 (https://phabricator.wikimedia.org/T384052) [14:46:32] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host relforge1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:47:05] !log joal@deploy1003 Finished deploy [analytics/refinery@e0320e1] (thin): Regular analytics weekly train THIN [analytics/refinery@e0320e14] (duration: 00m 50s) [14:48:11] (03PS1) 10Scott French: php8.1: rebuild to pick up new php and php-excimer packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1130626 (https://phabricator.wikimedia.org/T389243) [14:48:28] !log joal@deploy1003 Started deploy [analytics/refinery@e0320e1] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@e0320e14] [14:48:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130611 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:48:58] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:49:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:49:04] !log joal@deploy1003 Finished deploy [analytics/refinery@e0320e1] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@e0320e14] (duration: 00m 36s) [14:49:40] RESOLVED: ErrorBudgetBurn: wdqs - wdqs-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:49:54] !log Starting s4 codfw failover from db2179 to db2240 - T389378 [14:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:58] T389378: Switchover s4 master (db2179 -> db2240) - https://phabricator.wikimedia.org/T389378 [14:50:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2240 to s4 primary T389378', diff saved to https://phabricator.wikimedia.org/P74348 and previous config saved to /var/cache/conftool/dbconfig/20250324-145018-marostegui.json [14:50:34] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [14:51:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2179 T389378', diff saved to https://phabricator.wikimedia.org/P74349 and previous config saved to /var/cache/conftool/dbconfig/20250324-145100-marostegui.json [14:51:31] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130319|Redirect credentials change pages to central domain (T362715)]], [[gerrit:1130607|Fix clearing stuck cookies: $wgCookiePrefix defaults to false, not null (T389796)]] (duration: 19m 38s) [14:51:36] T362715: Move credentials change to central login wiki - https://phabricator.wikimedia.org/T362715 [14:51:36] T389796: Unable to autologin on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T389796 [14:51:48] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host relforge1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:51:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74350 and previous config saved to /var/cache/conftool/dbconfig/20250324-145150-root.json [14:52:10] (03PS20) 10Elukey: sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [14:52:26] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:52:28] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:52:36] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2179.codfw.wmnet [14:52:54] !log UTC afternoon deploys done [14:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:18] (03PS21) 10Elukey: sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [14:53:19] (03CR) 10Kamila Součková: [C:03+1] php8.1: rebuild to pick up new php and php-excimer packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1130626 (https://phabricator.wikimedia.org/T389243) (owner: 10Scott French) [14:53:19] tgr_: thanks! [14:53:35] sure [14:53:36] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:56:53] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [14:57:15] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [14:57:20] (03CR) 10Clément Goubert: [C:03+1] php8.1: rebuild to pick up new php and php-excimer packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1130626 (https://phabricator.wikimedia.org/T389243) (owner: 10Scott French) [14:58:51] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:59:27] (03PS26) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) [14:59:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2179.codfw.wmnet [15:00:05] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2179.codfw.wmnet with reason: Index rebuild [15:00:42] (03CR) 10Elukey: [C:03+1] "Left a nit for a comment, please proceed afterwards!" [puppet] - 10https://gerrit.wikimedia.org/r/1130611 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:01:24] !incidents [15:01:25] 5780 (UNACKED) db2179 (paged)/MariaDB Replica Lag: s4 (paged) [15:01:25] 5779 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [15:01:25] (03CR) 10Dzahn: [C:03+1] gerrit: raise heap limit from 32g to 64g [puppet] - 10https://gerrit.wikimedia.org/r/1130597 (https://phabricator.wikimedia.org/T387223) (owner: 10Hashar) [15:01:25] 5778 (RESOLVED) Manual (paged) by Rae Adimer (radimer@wikimedia.org): MediaWiki internal error on loading Wikipedia [15:01:27] !incidents [15:01:28] 5780 (UNACKED) db2179 (paged)/MariaDB Replica Lag: s4 (paged) [15:01:28] 5779 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [15:01:28] 5778 (RESOLVED) Manual (paged) by Rae Adimer (radimer@wikimedia.org): MediaWiki internal error on loading Wikipedia [15:01:37] !ack db2179 [15:01:37] the value db2179 doesn't match the regexp ^\d+$ [15:01:38] Incident id must be an integer [15:01:46] !ack 5780 [15:01:47] 5780 (ACKED) db2179 (paged)/MariaDB Replica Lag: s4 (paged) [15:02:23] 06SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817 (10Chuckonwumelu) 03NEW [15:02:34] I downtimed it :( [15:03:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host relforge1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:03:20] !resolve 5780 [15:03:20] 5780 (RESOLVED) db2179 (paged)/MariaDB Replica Lag: s4 (paged) [15:03:24] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [15:03:35] (03PS1) 10Gehel: style(query_service): extract common alerting configuration [puppet] - 10https://gerrit.wikimedia.org/r/1130631 [15:03:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [15:03:57] (03CR) 10CI reject: [V:04-1] style(query_service): extract common alerting configuration [puppet] - 10https://gerrit.wikimedia.org/r/1130631 (owner: 10Gehel) [15:03:58] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:04:08] I see the downtime in SAL. And yes there is maintenance on db2179 so I'll leave that to the data-persistence team [15:04:57] jelto: I downtimed it earlier but it yet paged, meh [15:05:38] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10668768 (10RobH) Replacement optics and fiber will arrive on March 26th by end of day. Not great delivery times considering I chose International Priority DHL. [15:05:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10668769 (10phaultfinder) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74351 and previous config saved to /var/cache/conftool/dbconfig/20250324-150655-root.json [15:08:25] FIRING: SystemdUnitFailed: opensearch_1@cloudelastic-eqiad.service on cloudelastic1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10668793 (10phaultfinder) [15:13:32] (03PS1) 10Ayounsi: Promote some network alerts from warning to critical [alerts] - 10https://gerrit.wikimedia.org/r/1130632 (https://phabricator.wikimedia.org/T384052) [15:14:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130596 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński) [15:14:17] (03CR) 10Jelto: [C:03+1] "lgtm, let me know if you need a second pair of eyes for admin_ng deployment" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [15:15:22] (03PS1) 10Dzahn: add hiera keys needed since spiderpig includes envoy on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/1130633 [15:15:22] (03CR) 10Elukey: "Hugh feel free to -2, this is an idea to keep in mind that came up while investigating the rate_limit cluster's 504s." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130574 (owner: 10Elukey) [15:15:45] (03CR) 10CI reject: [V:04-1] add hiera keys needed since spiderpig includes envoy on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/1130633 (owner: 10Dzahn) [15:16:02] (03CR) 10Dzahn: "added in Hiera on Horizon by dancy. syncing to repo." [puppet] - 10https://gerrit.wikimedia.org/r/1130633 (owner: 10Dzahn) [15:16:40] (03CR) 10Krinkle: [C:04-1] "Per Filippo: The queries sublists are useful to a certain use case, so I propose we remove the trimmed version and instead under a flag ad" [software] - 10https://gerrit.wikimedia.org/r/1129242 (owner: 10Filippo Giunchedi) [15:17:28] (03PS2) 10Dzahn: add hiera keys needed since spiderpig includes envoy [puppet] - 10https://gerrit.wikimedia.org/r/1130633 [15:17:53] (03PS2) 10Gehel: style(query_service): extract common alerting configuration [puppet] - 10https://gerrit.wikimedia.org/r/1130631 [15:18:00] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130631 (owner: 10Gehel) [15:18:25] RESOLVED: SystemdUnitFailed: opensearch_1@cloudelastic-eqiad.service on cloudelastic1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:18:45] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:20:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:20:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10668870 (10phaultfinder) [15:22:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74352 and previous config saved to /var/cache/conftool/dbconfig/20250324-152201-root.json [15:23:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:23:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:24:23] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:24:38] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:25:16] (03PS1) 10Clément Goubert: thumbor: Backport maxUnavailable from production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130636 [15:25:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:26:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2005.codfw.wmnet with OS bookworm [15:27:01] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10668925 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-ctrl2005.codfw.wmnet with O... [15:28:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2254 to codfw - jhancock@cumin2002" [15:28:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2254 to codfw - jhancock@cumin2002" [15:28:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:28:51] (03CR) 10Brouberol: [C:03+1] Remove the cloudnativepg backup export checks [alerts] - 10https://gerrit.wikimedia.org/r/1130619 (https://phabricator.wikimedia.org/T389466) (owner: 10Btullis) [15:28:58] (03CR) 10Brouberol: [C:03+1] Remove Hadoop worker specific disk checks [alerts] - 10https://gerrit.wikimedia.org/r/1130609 (https://phabricator.wikimedia.org/T389466) (owner: 10Btullis) [15:29:05] (03CR) 10DCausse: [C:04-1] "not ready yet, some components are still writing to the rc0 weighted_tags stream" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124486 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [15:30:05] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T1530). [15:30:43] (03PS2) 10Clément Goubert: thumbor: Backport maxUnavailable from production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130636 [15:31:14] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10669001 (10RobH) Updated the open support case: > Support, > > Since updating the firmware and returning to service, the PCIe bus is throwing errors on using the two NVMe PCI SSDs: > > sudo cat /et... [15:32:43] (03PS3) 10Clément Goubert: thumbor: Fix UnavailableReplicas alert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130636 [15:32:46] (03PS2) 10Scott French: php8.1: rebuild to pick up new php and php-excimer packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1130626 (https://phabricator.wikimedia.org/T389243) [15:32:50] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2144.codfw.wmnet,db1151.eqiad.wmnet with reason: Maintenance in x2 [15:33:36] (03CR) 10Scott French: "Thank you both for the reviews! So, I've modified the versioning a bit to match the ongoing discussion, particularly to reset to the under" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1130626 (https://phabricator.wikimedia.org/T389243) (owner: 10Scott French) [15:33:47] (03CR) 10Marostegui: [C:03+2] mariadb: Move hosts to ms2 [puppet] - 10https://gerrit.wikimedia.org/r/1130600 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [15:34:15] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:34:22] (03CR) 10Scott French: [C:03+1] thumbor: Fix UnavailableReplicas alert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130636 (owner: 10Clément Goubert) [15:34:52] 06SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817#10669019 (10aborrero) 05Open→03In progress a:03joanna_borun Assigning to Joanna for approval. [15:35:43] 06SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817#10669028 (10Chuckonwumelu) [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:01] 06SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817#10669042 (10Chuckonwumelu) [15:37:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74353 and previous config saved to /var/cache/conftool/dbconfig/20250324-153706-root.json [15:37:32] (03CR) 10Clément Goubert: [C:03+1] "LGTM as long as our build scripts don't get confused." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1130626 (https://phabricator.wikimedia.org/T389243) (owner: 10Scott French) [15:37:34] 06SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817#10669055 (10Chuckonwumelu) [15:37:41] (03PS1) 10Marostegui: db1151,db2144: Make them masters in ms2 [puppet] - 10https://gerrit.wikimedia.org/r/1130639 (https://phabricator.wikimedia.org/T387332) [15:38:16] (03CR) 10Marostegui: [C:03+2] db1151,db2144: Make them masters in ms2 [puppet] - 10https://gerrit.wikimedia.org/r/1130639 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [15:38:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2005.codfw.wmnet with reason: host reimage [15:38:47] (03CR) 10Clément Goubert: [C:03+2] thumbor: Fix UnavailableReplicas alert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130636 (owner: 10Clément Goubert) [15:39:33] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2254 to codfw - jhancock@cumin2002" [15:39:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2254 to codfw - jhancock@cumin2002" [15:39:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:40:01] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2254 [15:40:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2254 [15:40:13] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2255 [15:40:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2255 [15:41:00] (03PS3) 10Muehlenhoff: postgresl/osm_master: Make postgresql listen on ipv4 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1130611 (https://phabricator.wikimedia.org/T381565) [15:41:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2254.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:41:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:41:15] (03CR) 10Muehlenhoff: postgresl/osm_master: Make postgresql listen on ipv4 on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1130611 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:41:17] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:41:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2254.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:42:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2005.codfw.wmnet with reason: host reimage [15:43:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:44:05] (03PS1) 10Brouberol: Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130642 (https://phabricator.wikimedia.org/T389762) [15:44:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:44:35] (03Merged) 10jenkins-bot: thumbor: Fix UnavailableReplicas alert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130636 (owner: 10Clément Goubert) [15:44:59] !log Deploying pending admin_ng changes to all clusters [15:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:21] !log cgoubert@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:45:25] (03CR) 10CI reject: [V:04-1] Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130642 (https://phabricator.wikimedia.org/T389762) (owner: 10Brouberol) [15:45:55] !log reprepro include php-excimer 1.2.3-1+wmf11u1 in component/php81 - T389243 [15:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:00] T389243: Update Excimer to 1.2.3 in production - https://phabricator.wikimedia.org/T389243 [15:46:43] !log cgoubert@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:46:51] !log cgoubert@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:47:57] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [15:49:41] !log cgoubert@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:49:57] !log cgoubert@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:51:09] !log cgoubert@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:51:19] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [15:51:37] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:51:43] !log cgoubert@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [15:52:04] (03CR) 10Elukey: [C:03+1] postgresl/osm_master: Make postgresql listen on ipv4 on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1130611 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:52:18] !log reprepro include php8.1 8.1.32-1+wmf11u1 in component/php81 [15:52:20] (03CR) 10Fabfur: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [15:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:26] jouncebot: nowandnext [15:53:26] For the next 0 hour(s) and 6 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T1530) [15:53:26] In 1 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T1700) [15:53:26] In 1 hour(s) and 6 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T1700) [15:53:54] (03PS1) 10Reedy: populateLocalAndGlobalIds.php: Fix rows with lu_local_id=0 or lu_global_id=0/null [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130643 (https://phabricator.wikimedia.org/T303590) [15:54:26] (03PS1) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [15:55:14] (03CR) 10Reedy: [C:03+2] populateLocalAndGlobalIds.php: Fix rows with lu_local_id=0 or lu_global_id=0/null [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130643 (https://phabricator.wikimedia.org/T303590) (owner: 10Reedy) [15:55:39] (03CR) 10Federico Ceratto: "Add "standard" helpers to the cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [15:55:56] (03PS2) 10Brouberol: Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130642 (https://phabricator.wikimedia.org/T389762) [15:56:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add ms2 to live traffic T387332', diff saved to https://phabricator.wikimedia.org/P74354 and previous config saved to /var/cache/conftool/dbconfig/20250324-155616-marostegui.json [15:56:21] T387332: Set up ms1, ms2, ms3 db clusters - https://phabricator.wikimedia.org/T387332 [15:56:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:56:28] !log cgoubert@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:56:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:57:11] !log cgoubert@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:57:21] (03PS3) 10Gehel: style(query_service): extract common alerting configuration [puppet] - 10https://gerrit.wikimedia.org/r/1130631 [15:57:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:57:47] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [15:58:00] (03PS1) 10Marostegui: ms3 hosts: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1130645 [15:58:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:58:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2005.codfw.wmnet with OS bookworm [15:58:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10669175 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-ctrl2005.codfw.wmnet with OS bo... [15:58:25] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [15:58:28] !log cgoubert@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:58:35] (03CR) 10Marostegui: [C:03+2] ms3 hosts: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1130645 (owner: 10Marostegui) [15:58:56] !log cgoubert@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:59:21] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [16:00:45] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:00:52] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:01:08] (03PS1) 10Bartosz Dziewoński: Fully silence TRX profiler after autocreation [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130648 (https://phabricator.wikimedia.org/T388165) [16:01:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2256.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:01:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130648 (https://phabricator.wikimedia.org/T388165) (owner: 10Bartosz Dziewoński) [16:01:36] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:01:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2256.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:01:57] !log cgoubert@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [16:02:04] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [16:02:15] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:02:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:02:46] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [16:03:11] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [16:03:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2254.codfw.wmnet with OS bookworm [16:03:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2255.codfw.wmnet with OS bookworm [16:03:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2256.codfw.wmnet with OS bookworm [16:03:49] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:03:52] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10669258 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2254.codfw.wmnet with... [16:03:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10669259 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2255.codfw.wmnet with... [16:03:55] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10669260 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2256.codfw.wmnet with... [16:04:11] (03Merged) 10jenkins-bot: populateLocalAndGlobalIds.php: Fix rows with lu_local_id=0 or lu_global_id=0/null [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130643 (https://phabricator.wikimedia.org/T303590) (owner: 10Reedy) [16:07:36] 06SRE, 06Infrastructure-Foundations: /etc/wikimedia/logout.d/50-systemdlogoutd sometimes fails to terminate user session on stat hosts - https://phabricator.wikimedia.org/T389324#10669280 (10MoritzMuehlenhoff) p:05Triage→03Medium [16:07:52] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add Puppet fact to determine the boot method - https://phabricator.wikimedia.org/T389217#10669287 (10MoritzMuehlenhoff) p:05Triage→03Low [16:08:24] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add Puppet fact to determine the boot method - https://phabricator.wikimedia.org/T389217#10669290 (10elukey) [16:08:45] RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:08:57] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host relforge1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:09:55] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [16:10:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [16:10:53] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10669309 (10Jhancock.wm) [16:11:40] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2327.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:12:10] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2327.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:17:45] 06SRE, 10WMF-General-or-Unknown, 07Performance Issue, 07Wikimedia-production-error: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" errors - https://phabricator.wikimedia.org/T389734#10669432 (10matmarex) Logs show increased error rate for this exception starting about a week a... [16:18:37] 06SRE, 06Infrastructure-Foundations: Make choice of firewall stack in insetup roles specific / Add nftables variants - https://phabricator.wikimedia.org/T389825 (10MoritzMuehlenhoff) 03NEW [16:19:49] (03CR) 10Btullis: [C:03+2] Remove Hadoop worker specific disk checks [alerts] - 10https://gerrit.wikimedia.org/r/1130609 (https://phabricator.wikimedia.org/T389466) (owner: 10Btullis) [16:19:59] (03CR) 10Btullis: [C:03+2] Remove the cloudnativepg backup export checks [alerts] - 10https://gerrit.wikimedia.org/r/1130619 (https://phabricator.wikimedia.org/T389466) (owner: 10Btullis) [16:21:05] (03Merged) 10jenkins-bot: Remove Hadoop worker specific disk checks [alerts] - 10https://gerrit.wikimedia.org/r/1130609 (https://phabricator.wikimedia.org/T389466) (owner: 10Btullis) [16:21:06] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host relforge1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:21:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host relforge1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:21:14] (03Merged) 10jenkins-bot: Remove the cloudnativepg backup export checks [alerts] - 10https://gerrit.wikimedia.org/r/1130619 (https://phabricator.wikimedia.org/T389466) (owner: 10Btullis) [16:21:22] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host relforge1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:22:19] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host relforge1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:22:37] (03CR) 10Btullis: Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130642 (https://phabricator.wikimedia.org/T389762) (owner: 10Brouberol) [16:22:46] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host relforge1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:23:42] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:23:46] (03CR) 10Btullis: [C:03+1] Remove docker related referrences on dse-k8s worker and master [puppet] - 10https://gerrit.wikimedia.org/r/1119106 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [16:24:40] (03CR) 10Brouberol: Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130642 (https://phabricator.wikimedia.org/T389762) (owner: 10Brouberol) [16:26:13] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [16:26:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10669510 (10Jhancock.wm) [16:28:02] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2291 to codfw - jhancock@cumin2002" [16:28:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2291 to codfw - jhancock@cumin2002" [16:28:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:28:17] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2291 [16:28:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2291 [16:29:18] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host relforge1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:29:22] (03CR) 10Btullis: [C:03+1] "Thanks, this looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1128406 (owner: 10Muehlenhoff) [16:29:26] (03CR) 10Dzahn: [C:03+2] add hiera keys needed since spiderpig includes envoy [puppet] - 10https://gerrit.wikimedia.org/r/1130633 (owner: 10Dzahn) [16:29:33] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host relforge1010.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:30:15] (03CR) 10Ayounsi: "Looking at history they have been working as expected." [alerts] - 10https://gerrit.wikimedia.org/r/1130632 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [16:30:44] (03PS1) 10Bernard Wang: Disable Search AB Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130652 [16:31:27] (03PS4) 10Clément Goubert: mediawiki: Use the servergroup to configure the dumps feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127916 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis) [16:32:23] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [16:33:45] (03PS2) 10Bernard Wang: Disable Search AB Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130652 (https://phabricator.wikimedia.org/T389399) [16:33:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130652 (https://phabricator.wikimedia.org/T389399) (owner: 10Bernard Wang) [16:34:33] (03CR) 10CI reject: [V:04-1] mediawiki: Use the servergroup to configure the dumps feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127916 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis) [16:34:37] (03CR) 10Dzahn: [C:03+1] "thank you, Jelto. I was going to follow https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Add_a_Kubernetes_namespace does t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [16:36:52] (03CR) 10AikoChou: [C:03+1] ml-services: update rrla staging image and env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130579 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [16:37:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130170 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [16:38:47] !log reedy@deploy1003 Synchronized php-1.44.0-wmf.21/extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php: T303590 (duration: 11m 51s) [16:38:51] T303590: Fix localuser rows with lu_local_id=0 or lu_global_id=0 - https://phabricator.wikimedia.org/T303590 [16:42:49] 06SRE, 06Discovery-Search, 06serviceops, 10Wikimedia-Apache-configuration, and 3 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10669644 (10Gehel) [16:43:20] 06SRE, 06Discovery-Search, 06serviceops, 10Wikimedia-Apache-configuration, and 3 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10669647 (10Gehel) We're not expecting the Search Platform team to do any work on t... [16:45:30] 06SRE, 06Discovery-Search, 06serviceops, 10Wikimedia-Apache-configuration, and 4 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10669663 (10simon04) Let's try to finish this on the #wikimedia-hackathon-2025 then... [16:45:46] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons. [16:47:57] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10669691 (10RobH) They've requested an updated TSR log from the idrac, so submitted that via the portal and a followup email: Uploaded TSR20250320194318_9VBNPR3.zip to the portal where I uploaded the fir... [16:48:59] (03CR) 10Elukey: sre.hosts.provision: try Supermicro BMC passwords automatically (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey) [16:49:46] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2291.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:52:08] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons. [16:53:12] 06SRE, 06Discovery-Search, 06serviceops, 10Wikimedia-Apache-configuration, and 4 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10669816 (10Clement_Goubert) >>! In T318285#10669663, @simon04 wrote: > Let's try t... [16:54:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int releases routed via main (k8s) 1.318s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:54:32] 06SRE, 06Discovery-Search, 06serviceops, 10Wikimedia-Apache-configuration, and 3 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10669822 (10simon04) Even better, thank you! Sorry for the noise! [16:55:10] !log btullis@cumin1002 START - Cookbook sre.apifeatureusage.roll-restart-reboot-logstash rolling restart_daemons on A:apifeatureusage [16:57:48] !log btullis@cumin1002 END (PASS) - Cookbook sre.apifeatureusage.roll-restart-reboot-logstash (exit_code=0) rolling restart_daemons on A:apifeatureusage [16:57:56] (03PS4) 10Gehel: style(query_service): extract common alerting configuration [puppet] - 10https://gerrit.wikimedia.org/r/1130631 [16:58:05] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130631 (owner: 10Gehel) [16:59:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int releases routed via main (k8s) 1.318s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:59:37] (03PS1) 10Hashar: gerrit: enable pushing notifications to browsers [puppet] - 10https://gerrit.wikimedia.org/r/1130656 (https://phabricator.wikimedia.org/T389327) [16:59:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74355 and previous config saved to /var/cache/conftool/dbconfig/20250324-165949-root.json [17:00:04] swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T1700). [17:00:04] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T1700). [17:00:09] o/ [17:00:19] I'll get started on this shortly [17:00:26] (03CR) 10Btullis: Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130642 (https://phabricator.wikimedia.org/T389762) (owner: 10Brouberol) [17:00:39] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2291.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:01:51] (03CR) 10Scott French: [V:03+2] "Built locally and verified to contain the expected packages." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1130626 (https://phabricator.wikimedia.org/T389243) (owner: 10Scott French) [17:01:56] (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: rebuild to pick up new php and php-excimer packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1130626 (https://phabricator.wikimedia.org/T389243) (owner: 10Scott French) [17:02:29] 06SRE, 06Discovery-Search, 06serviceops, 10Wikimedia-Apache-configuration, and 3 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10669902 (10Clement_Goubert) We may do it at the end of that window or right after... [17:02:35] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2291.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:07:56] (03CR) 10Ssingh: "Looks good, one blocker I think and one question." [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [17:08:27] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2291.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:08:41] !log rebuilt php8.1 production image suite (8.1.32-1-s1) - T389243 [17:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:45] T389243: Update Excimer to 1.2.3 in production - https://phabricator.wikimedia.org/T389243 [17:09:58] !log swfrench@deploy1003 Started scap sync-world: Deployment to pick up new php8.1 production image - T389243 [17:10:46] (03CR) 10Ssingh: sre.network.cf: log if no changes were made (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1130135 (owner: 10Ssingh) [17:13:50] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey) [17:14:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74356 and previous config saved to /var/cache/conftool/dbconfig/20250324-171454-root.json [17:14:58] (03PS1) 10Zoe: Don't clobber error information for failed Flow creates [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130658 (https://phabricator.wikimedia.org/T380911) [17:16:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130658 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [17:16:33] (03PS2) 10DLynch: Enable VisualEditor EditCheck multi-check a/b test on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128921 (https://phabricator.wikimedia.org/T384372) [17:16:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10669984 (10phaultfinder) [17:16:52] (03PS1) 10Daimona Eaytoy: Drop unused $wgCampaignEventsSeparateOngoingEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130659 (https://phabricator.wikimedia.org/T386428) [17:17:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130659 (https://phabricator.wikimedia.org/T386428) (owner: 10Daimona Eaytoy) [17:17:48] (03CR) 10Volans: sre.network.cf: log if no changes were made (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1130135 (owner: 10Ssingh) [17:18:37] !log swfrench@deploy1003 swfrench: Deployment to pick up new php8.1 production image - T389243 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:18:41] T389243: Update Excimer to 1.2.3 in production - https://phabricator.wikimedia.org/T389243 [17:19:19] (03PS1) 10BCornwall: upgrade cp7001 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130660 (https://phabricator.wikimedia.org/T378737) [17:19:20] (03PS1) 10BCornwall: upgrade cp7002 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130661 (https://phabricator.wikimedia.org/T378737) [17:19:21] (03PS1) 10BCornwall: upgrade cp7003 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130662 (https://phabricator.wikimedia.org/T378737) [17:19:23] (03PS1) 10BCornwall: upgrade cp7004 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130663 (https://phabricator.wikimedia.org/T378737) [17:19:24] (03PS1) 10BCornwall: upgrade cp7005 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130664 (https://phabricator.wikimedia.org/T378737) [17:19:26] (03PS1) 10BCornwall: upgrade cp7006 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130665 (https://phabricator.wikimedia.org/T378737) [17:19:27] (03PS1) 10BCornwall: upgrade cp7007 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130666 (https://phabricator.wikimedia.org/T378737) [17:19:29] (03PS1) 10BCornwall: upgrade cp7008 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130667 (https://phabricator.wikimedia.org/T378737) [17:19:33] (03PS1) 10BCornwall: upgrade cp7009 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130668 (https://phabricator.wikimedia.org/T378737) [17:19:38] (03PS1) 10BCornwall: upgrade cp7010 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130669 (https://phabricator.wikimedia.org/T378737) [17:19:42] (03PS1) 10BCornwall: upgrade cp7011 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130670 (https://phabricator.wikimedia.org/T378737) [17:19:46] (03PS1) 10BCornwall: upgrade cp7012 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130671 (https://phabricator.wikimedia.org/T378737) [17:19:52] (03PS1) 10BCornwall: upgrade cp7013 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130672 (https://phabricator.wikimedia.org/T378737) [17:19:56] (03PS1) 10BCornwall: upgrade cp7014 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130673 (https://phabricator.wikimedia.org/T378737) [17:20:00] (03PS1) 10BCornwall: upgrade cp7015 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130674 (https://phabricator.wikimedia.org/T378737) [17:20:04] (03PS1) 10BCornwall: upgrade cp7016 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130675 (https://phabricator.wikimedia.org/T378737) [17:20:05] sorry for the spam [17:21:19] (03CR) 10Ssingh: [C:03+1] upgrade cp7001 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130660 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:21:25] (03CR) 10Ssingh: [C:03+1] upgrade cp7002 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130661 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:21:31] (03CR) 10Ssingh: [C:03+1] upgrade cp7003 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130662 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:21:34] (03CR) 10Ssingh: [C:03+1] upgrade cp7004 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130663 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:21:41] (03CR) 10Ssingh: [C:03+1] upgrade cp7005 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130664 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:21:44] (03CR) 10Ssingh: [C:03+1] upgrade cp7006 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130665 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:21:51] (03CR) 10Ssingh: [C:03+1] upgrade cp7007 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130666 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:21:57] (03CR) 10Ssingh: [C:03+1] upgrade cp7008 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130667 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:22:01] (03CR) 10Ssingh: [C:03+1] upgrade cp7009 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130668 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:22:07] (03CR) 10Ssingh: [C:03+1] upgrade cp7010 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130669 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:22:14] (03CR) 10Ssingh: [C:03+1] upgrade cp7011 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130670 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:22:21] (03CR) 10Ssingh: [C:03+1] upgrade cp7012 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130671 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:22:29] (03CR) 10Ssingh: [C:03+1] upgrade cp7013 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130672 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:22:32] (03CR) 10Ssingh: [C:03+1] upgrade cp7014 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130673 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:22:35] (03CR) 10Ssingh: [C:03+1] upgrade cp7015 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130674 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:22:38] (03CR) 10Ssingh: [C:03+1] upgrade cp7016 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130675 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:23:39] !log swfrench@deploy1003 swfrench: Continuing with sync [17:24:23] (03CR) 10BCornwall: [C:03+2] upgrade cp7001 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130660 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:24:40] (03PS1) 10Volans: context managers: combine them when feasible [cookbooks] - 10https://gerrit.wikimedia.org/r/1130676 [17:26:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10670099 (10phaultfinder) [17:26:44] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:26:56] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7001.magru.wmnet} and A:cp [17:27:19] (03CR) 10Pppery: "Un-CCing myself since I don't need to get emails about the progress of the backport - I'm watching this closely enough already." [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130658 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [17:30:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74357 and previous config saved to /var/cache/conftool/dbconfig/20250324-173000-root.json [17:30:40] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updare frack node to use new mgmt subnet 10.195.1.1/25 - pt1979@cumin2002" [17:30:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updare frack node to use new mgmt subnet 10.195.1.1/25 - pt1979@cumin2002" [17:30:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:32:45] 10ops-codfw, 06SRE, 06DC-Ops: Renumber frack server mgmt IPs in codfw - https://phabricator.wikimedia.org/T371468#10670156 (10Papaul) All the nodes are now on the 10.195.1.1/25 network, but not bast2002. I will work on bast2002 sometimes tomorrow [17:32:47] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7001.magru.wmnet} and A:cp [17:33:22] !log swfrench@deploy1003 Finished scap sync-world: Deployment to pick up new php8.1 production image - T389243 (duration: 23m 54s) [17:33:26] T389243: Update Excimer to 1.2.3 in production - https://phabricator.wikimedia.org/T389243 [17:35:20] (03PS3) 10Esanders: VE: Disable upcoming mobile insert menu everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 [17:35:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:35:51] inflatador: ^ known? [17:36:03] that should be everything from me for the infra window today [17:43:02] (03PS1) 10BCornwall: sre.cdn.roll-upgrade-varnish: Install varnishkafka [cookbooks] - 10https://gerrit.wikimedia.org/r/1130681 [17:43:41] (03PS2) 10BCornwall: sre.cdn.roll-upgrade-varnish: Install varnishkafka [cookbooks] - 10https://gerrit.wikimedia.org/r/1130681 [17:44:14] (03CR) 10Ssingh: [C:03+1] sre.cdn.roll-upgrade-varnish: Install varnishkafka [cookbooks] - 10https://gerrit.wikimedia.org/r/1130681 (owner: 10BCornwall) [17:44:36] FIRING: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [17:44:43] hmmm [17:45:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74358 and previous config saved to /var/cache/conftool/dbconfig/20250324-174505-root.json [17:45:18] 🎵 fun times are coming 🎵 [17:46:24] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for mstyles and mmartorana - https://phabricator.wikimedia.org/T389843 (10Mstyles) 03NEW [17:50:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10670249 (10phaultfinder) [17:51:01] !log T379002 Start reindex of cebwiki search indices in cloudelastic [17:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:05] T379002: Consider resharding cebwiki_content - https://phabricator.wikimedia.org/T379002 [17:52:59] (03CR) 10BCornwall: [V:03+2 C:03+2] sre.cdn.roll-upgrade-varnish: Install varnishkafka [cookbooks] - 10https://gerrit.wikimedia.org/r/1130681 (owner: 10BCornwall) [17:53:02] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for mstyles and mmartorana - https://phabricator.wikimedia.org/T389843#10670269 (10Dzahn) Hi, to clarify: do you mean access to the web UI or do you mean shell access to logstash servers? The former should be granted by the "wmf" LDAP group and y... [17:54:24] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for mstyles and mmartorana - https://phabricator.wikimedia.org/T389843#10670273 (10Mstyles) @Dzahn when I try to login to the logstash UI, I'm getting a service access denied due to missing privileges [17:54:53] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for mstyles and mmartorana - https://phabricator.wikimedia.org/T389843#10670293 (10Mstyles) I don't need shell access to the servers, so I might have requested the wrong access? But either way, I still don't have access to the UI [17:56:43] " Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException"" [17:57:06] Full page error, not even Wikimedia one [17:57:08] "Exception caught inside exception handler." [17:57:18] "Original exception: [54642001-47c0-48bd-8078-a052ffeaa6cf] 2025-03-24 17:56:30" [17:57:19] :O [17:59:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10670335 (10phaultfinder) [18:00:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74359 and previous config saved to /var/cache/conftool/dbconfig/20250324-180010-root.json [18:00:38] (03CR) 10BCornwall: [C:03+2] upgrade cp7002 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130661 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:00:44] (03CR) 10BCornwall: [C:03+2] upgrade cp7016 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130675 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:00:56] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for mstyles and mmartorana - https://phabricator.wikimedia.org/T389843#10670349 (10Dzahn) In that case this is the right type of access request. Just that it _should_ already work. Sounds like either we have a bug (then I would refer to infra foun... [18:00:59] (03PS2) 10BCornwall: upgrade cp7016 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130675 (https://phabricator.wikimedia.org/T378737) [18:01:08] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp7016 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130675 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:01:23] [54642001-47c0-48bd-8078-a052ffeaa6cf] /wiki/Special:Contributions/***** Wikimedia\RequestTimeout\EmergencyTimeoutException: The critical section "Wikimedia\Rdbms\Database::executeQuery" timed out after 180 seconds [18:01:38] Indeed [18:01:52] I was just loading a contributions page [18:02:05] Odd.. [18:02:18] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp70[02,16].magru.wmnet} and A:cp [18:02:59] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for mstyles and mmartorana - https://phabricator.wikimedia.org/T389843#10670363 (10acooper) According to an email that went out logstash access now requires this new process to be followed. Appreciate if this can be handled as a priority as we need... [18:03:01] 06SRE, 10WMF-General-or-Unknown, 07Performance Issue, 07Wikimedia-production-error: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" errors - https://phabricator.wikimedia.org/T389734#10670365 (10Scott_French) A couple of data points: The `Wikimedia\Rdbms\DBUnexpectedError` see... [18:04:08] Oh that's funny. I just mentioned that [18:04:12] :O [18:04:36] RESOLVED: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [18:04:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10670367 (10phaultfinder) [18:05:28] (03CR) 10Jdlrobson: "Qies" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130652 (https://phabricator.wikimedia.org/T389399) (owner: 10Bernard Wang) [18:07:26] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update rrla staging image and env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130579 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [18:07:53] 10ops-drmrs: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389848 (10phaultfinder) 03NEW [18:08:35] !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=97) rolling upgrade of Varnish on P{cp70[02,16].magru.wmnet} and A:cp [18:08:37] ^Expected, cp7002 applied fine and I'm just canceling during a sleep [18:08:47] (03Merged) 10jenkins-bot: ml-services: update rrla staging image and env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130579 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [18:08:49] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7016.magru.wmnet} and A:cp [18:09:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10670388 (10phaultfinder) [18:14:26] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7016.magru.wmnet} and A:cp [18:15:39] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10670408 (10Scott_French) @AStein-WMF - If possible, could you please confirm whether logging in as Astein allows you to ack a... [18:17:03] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10670430 (10Scott_French) @DSmit-WMF - If possible, could you please confirm whether your membership in `wmf-deployment` has resolved this? Thanks! [18:17:54] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for mstyles and mmartorana - https://phabricator.wikimedia.org/T389843#10670434 (10Dzahn) p:05Triage→03High Aha! Thanks for adding that. Then we should also fix the Wikitech docs that still claim that wmf grants access to it. [18:20:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10670445 (10phaultfinder) [18:20:44] (03PS1) 10Cwhite: prometheus: add recording rules for use by histogram_quantile [puppet] - 10https://gerrit.wikimedia.org/r/1130689 (https://phabricator.wikimedia.org/T383963) [18:23:17] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for mstyles and mmartorana - https://phabricator.wikimedia.org/T389843#10670453 (10Clement_Goubert) `logstash-access` is one of the groups that can be requested by [[ https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access#Using_the_Wiki... [18:24:42] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:25:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10670485 (10phaultfinder) [18:29:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:31:52] "Our servers are currently under maintenance or experiencing a technical issue [18:31:52] " [18:32:01] "Error: 503, Backend fetch failed at Mon, 24 Mar 2025 18:31:21 GMT" :O [18:32:32] which site? [18:32:37] what cp server [18:32:46] "Request served via cp1108 cp1108," [18:33:03] yeah [18:33:11] something in text is not happy [18:33:27] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for mstyles and mmartorana - https://phabricator.wikimedia.org/T389843#10670584 (10Scott_French) 05Open→03In progress a:03Scott_French Taking a look now. [18:33:29] I have the Varnish XID as well [18:34:00] share it in DM [18:35:27] (03CR) 10BCornwall: [C:03+1] "This is fine, we just limit the amount of SNIs in a cert - it's not a hard limit, though." [puppet] - 10https://gerrit.wikimedia.org/r/1127551 (https://phabricator.wikimedia.org/T388809) (owner: 10Reedy) [18:36:21] 06SRE, 13Patch-For-Review: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10670627 (10BCornwall) Thanks for reporting this. We're slowly trickling in a lot of domains into acme-chief so there are a lot still missing. This was one of them. [18:37:00] 06SRE, 10WMF-General-or-Unknown, 07Performance Issue, 07Wikimedia-production-error: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" errors - https://phabricator.wikimedia.org/T389734#10670637 (10Samwalton9-WMF) > Either that, or the duration reported in the access logs is bogus... [18:38:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74360 and previous config saved to /var/cache/conftool/dbconfig/20250324-183850-root.json [18:38:59] (03CR) 10BCornwall: [C:03+2] upgrade cp7003 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130662 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:39:00] (03CR) 10BCornwall: [C:03+2] upgrade cp7015 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130674 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:39:15] (03PS2) 10BCornwall: upgrade cp7015 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130674 (https://phabricator.wikimedia.org/T378737) [18:39:25] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp7015 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130674 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:41:05] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7003.magru.wmnet} and A:cp [18:41:15] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7015.magru.wmnet} and A:cp [18:44:50] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for mstyles and mmartorana - https://phabricator.wikimedia.org/T389843#10670741 (10Scott_French) Confirmed that neither `mstyles` nor `mmartorana` were members of `logstash-access`. Both are of course members of `wmf` which should have granted them... [18:47:03] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7015.magru.wmnet} and A:cp [18:47:09] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7003.magru.wmnet} and A:cp [18:49:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10670785 (10phaultfinder) [18:51:20] (03PS3) 10Brouberol: Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130642 (https://phabricator.wikimedia.org/T389762) [18:51:32] (03CR) 10Brouberol: Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130642 (https://phabricator.wikimedia.org/T389762) (owner: 10Brouberol) [18:53:26] (03CR) 10Nik Gkountas: [C:03+1] AX: Disable automatic translation entrypoints before release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130345 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [18:53:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74361 and previous config saved to /var/cache/conftool/dbconfig/20250324-185355-root.json [18:54:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10670821 (10phaultfinder) [18:59:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [19:00:08] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [19:01:20] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for mstyles and mmartorana - https://phabricator.wikimedia.org/T389843#10670871 (10Mstyles) 05In progress→03Resolved works now, thank you! [19:01:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [19:01:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [19:02:42] FIRING: AlertLintProblem: Linting problems found for RdfStreamingUpdaterHighConsumerUpdateLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [19:06:24] 06SRE, 10LDAP-Access-Requests: Disable BarryTheBrowserTestBot LDAP account - https://phabricator.wikimedia.org/T388662#10670889 (10Scott_French) 05Open→03Declined As proposed by @ssingh in T388662#10641905, I'm going to move this to Declined in favor of an automatic solution in T335478, particularly gi... [19:09:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74362 and previous config saved to /var/cache/conftool/dbconfig/20250324-190900-root.json [19:13:49] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:15:11] (03PS1) 10Bking: cloudelastic: replace failed master-eligible host [puppet] - 10https://gerrit.wikimedia.org/r/1130701 (https://phabricator.wikimedia.org/T388150) [19:15:34] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130701 (https://phabricator.wikimedia.org/T388150) (owner: 10Bking) [19:16:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:17:23] (03CR) 10BCornwall: [C:03+2] upgrade cp7004 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130663 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:17:27] (03CR) 10BCornwall: [C:03+2] upgrade cp7013 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130672 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:17:36] (03PS2) 10BCornwall: upgrade cp7013 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130672 (https://phabricator.wikimedia.org/T378737) [19:17:40] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp7013 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130672 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:17:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10671026 (10RKemper) a:03Jclark-ctr Looks like we'd forgotten to... [19:18:48] (03CR) 10Ryan Kemper: [C:03+1] cloudelastic: replace failed master-eligible host [puppet] - 10https://gerrit.wikimedia.org/r/1130701 (https://phabricator.wikimedia.org/T388150) (owner: 10Bking) [19:18:51] (03CR) 10Ryan Kemper: [C:03+2] cloudelastic: replace failed master-eligible host [puppet] - 10https://gerrit.wikimedia.org/r/1130701 (https://phabricator.wikimedia.org/T388150) (owner: 10Bking) [19:19:29] ryankemper: Okay for me to merge that in? [19:19:58] brett: absolutely [19:21:08] done [19:21:13] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7004.magru.wmnet} and A:cp [19:21:14] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7013.magru.wmnet} and A:cp [19:23:22] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2254 [19:23:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2254 [19:24:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74363 and previous config saved to /var/cache/conftool/dbconfig/20250324-192406-root.json [19:24:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2254.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:25:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2254.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:26:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2254.codfw.wmnet with OS bookworm [19:26:35] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671069 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2254.codfw.wmnet with... [19:26:50] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7013.magru.wmnet} and A:cp [19:27:27] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7004.magru.wmnet} and A:cp [19:29:55] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2255 [19:30:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2255 [19:30:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:30:35] (03PS1) 10Brouberol: Add airflow.discovery.wmnet to the airflow-main x509 SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130705 [19:31:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:31:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2255.codfw.wmnet with OS bookworm [19:31:41] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671094 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2255.codfw.wmnet with... [19:32:27] 06SRE, 10WMF-General-or-Unknown, 07Performance Issue, 07Wikimedia-production-error: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" errors - https://phabricator.wikimedia.org/T389734#10671097 (10Krinkle) >>! In T389734#10670365, @Scott_French wrote: > […] > `reqId:"e81a219b-3ca... [19:32:42] RESOLVED: AlertLintProblem: Linting problems found for RdfStreamingUpdaterHighConsumerUpdateLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [19:33:31] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:33:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2255.codfw.wmnet with OS... [19:34:45] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130705 (owner: 10Brouberol) [19:36:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:36:43] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2256 [19:36:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2256 [19:37:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2256.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:37:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2256.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:38:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2254.codfw.wmnet with reason: host reimage [19:39:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74364 and previous config saved to /var/cache/conftool/dbconfig/20250324-193911-root.json [19:39:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2256.codfw.wmnet with OS bookworm [19:39:49] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671130 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2256.codfw.wmnet with... [19:41:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2254.codfw.wmnet with reason: host reimage [19:45:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10671139 (10phaultfinder) [19:48:41] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host relforge1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:48:53] (03CR) 10BCornwall: [C:03+2] upgrade cp7005 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130664 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:48:58] (03PS2) 10BCornwall: upgrade cp7012 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130671 (https://phabricator.wikimedia.org/T378737) [19:49:01] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2257 [19:49:04] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp7012 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130671 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:49:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2257 [19:49:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2257.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:50:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2257.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:50:19] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7005.magru.wmnet} and A:cp [19:50:21] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7012.magru.wmnet} and A:cp [19:50:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10671173 (10phaultfinder) [19:50:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2257.codfw.wmnet with OS bookworm [19:50:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2257.codfw.wmnet with... [19:51:43] jouncebot: now and next [19:51:43] No deployments scheduled for the next 0 hour(s) and 8 minute(s) [19:51:49] jouncebot: next [19:51:49] In 0 hour(s) and 8 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T2000) [19:52:58] Dibs! Would love to backport some patches in the upcoming window (releng is doing some experimentation/tire-kicking of deployment tooling) [19:54:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host relforge1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:55:04] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2258 [19:55:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2258 [19:55:54] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7005.magru.wmnet} and A:cp [19:55:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2258.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:56:08] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7012.magru.wmnet} and A:cp [19:56:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2258.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:56:56] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:57:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2256.codfw.wmnet with reason: host reimage [19:57:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:57:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2254.codfw.wmnet with OS bookworm [19:57:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671186 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2254.codfw.wmnet with OS... [19:59:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2258.codfw.wmnet with OS bookworm [19:59:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671194 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2258.codfw.wmnet with... [19:59:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and thcipriani: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T2000). nyaa~ [20:00:05] tgr, bwang, kimberly_sarabia, and zip: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] Dibs! Would love to backport some patches in the upcoming window (releng is doing some experimentation/tire-kicking of deployment tooling) (said this earlier but buried in backscroll) [20:00:13] oooh a sticker [20:00:14] hello [20:00:40] hi pppery, I appreciate your help with the flow stuff! [20:00:47] You're welcome [20:01:14] cool, so we have a kimberly_sarabia and a zip do we have a bwang or tgr_ ? [20:01:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124535 (https://phabricator.wikimedia.org/T375520) (owner: 10Ryan Kemper) [20:01:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2256.codfw.wmnet with reason: host reimage [20:01:55] o/ [20:02:07] Pppery: what's the canonical pronunciation of you handle anyway? peppery? perry? p-p-perry? [20:02:18] p-p-perry is most like it [20:02:24] hey tgr_ we'll get you going first [20:02:30] it's a corruption of my real name of "Perry", but I use no consistent pronounciation myself [20:02:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2257.codfw.wmnet with reason: host reimage [20:02:33] this has come up this week [20:03:05] I probably should change my handle to "Perry", but that SUL name is already in use [20:03:15] rude when other people get the good handles first [20:03:34] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host relforge1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:04:18] thcipriani: you can deploy them together. Or together with other things. They don't really need testing. [20:05:01] Pppery: I'm up late for the deploy, I'll be re-running the script in the morning (UK time) [20:05:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by spiderpig@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130121 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [20:05:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by spiderpig@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130122 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [20:05:23] tgr_: cool, we'll do 'em together [20:05:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2257.codfw.wmnet with reason: host reimage [20:06:13] (03Merged) 10jenkins-bot: Enable SUL3 login for all group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130121 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [20:06:16] (03Merged) 10jenkins-bot: Enable SUL3 login for 1% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130122 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [20:06:30] !log spiderpig@deploy1003 Started scap sync-world: Backport for [[gerrit:1130121|Enable SUL3 login for all group 1 users (T384153)]], [[gerrit:1130122|Enable SUL3 login for 1% of group 2 users (T384219)]] [20:06:35] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [20:06:36] T384219: SUL3 Phase 4: Staged rollout for all existing users - https://phabricator.wikimedia.org/T384219 [20:06:38] (Is it correct for the log message to say `approved by spiderpig` without specifying who used spiderpig to approve it?) [20:06:58] ^ noted [20:07:11] And also spiderpig.wikimedia.org when viewed logged out has a flash of content before redirecting to the login page, which seems strange [20:07:24] Pppery: good note! On the list (exactly why we're tire-kicking) [20:07:53] and yeah, shows the empty ui and then makes an api request and sends you to auth [20:08:01] also on the list! [20:11:35] !log spiderpig@deploy1003 tgr, spiderpig: Backport for [[gerrit:1130121|Enable SUL3 login for all group 1 users (T384153)]], [[gerrit:1130122|Enable SUL3 login for 1% of group 2 users (T384219)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:11:40] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [20:11:41] T384219: SUL3 Phase 4: Staged rollout for all existing users - https://phabricator.wikimedia.org/T384219 [20:11:58] tgr_: nothing to test, correct? [20:14:03] hi im here, sorry a bit ate [20:14:28] hey bwang not late yet, window still going [20:14:52] !log spiderpig@deploy1003 tgr, spiderpig: Continuing with sync [20:14:57] (03CR) 10Bernard Wang: Disable Search AB Test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130652 (https://phabricator.wikimedia.org/T389399) (owner: 10Bernard Wang) [20:15:27] (03PS1) 10Reedy: PopulateLocalAndGlobalIds: Don't set lu_local_id if we don't have a mapping... [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130712 (https://phabricator.wikimedia.org/T303590) [20:16:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2258.codfw.wmnet with reason: host reimage [20:16:32] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:17:17] investigating some bare-metal keyholder issues [20:17:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:17:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2256.codfw.wmnet with OS bookworm [20:17:42] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671280 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2256.codfw.wmnet with OS... [20:18:25] thcipriani: sorry, yeah not needed [20:18:42] cool, we took your silence to mean that :) [20:19:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2258.codfw.wmnet with reason: host reimage [20:20:25] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:21:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:21:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2257.codfw.wmnet with OS bookworm [20:21:12] as it goes, my changes also doe snot need checking [20:21:19] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671293 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2257.codfw.wmnet with OS... [20:22:52] !log spiderpig@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130121|Enable SUL3 login for all group 1 users (T384153)]], [[gerrit:1130122|Enable SUL3 login for 1% of group 2 users (T384219)]] (duration: 16m 21s) [20:22:57] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [20:22:58] T384219: SUL3 Phase 4: Staged rollout for all existing users - https://phabricator.wikimedia.org/T384219 [20:23:55] (03CR) 10Jdlrobson: Disable Search AB Test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130652 (https://phabricator.wikimedia.org/T389399) (owner: 10Bernard Wang) [20:24:42] !log brennen@deploy1003 Started scap sync-world: sync world after keyholder errors during spiderpig testing [20:24:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host relforge1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:26:21] some hiccups with keyholder under spiderpig, doing a `scap sync-world` to update bare metal boxen before going ahead. [20:26:52] (03CR) 10Bernard Wang: Disable Search AB Test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130652 (https://phabricator.wikimedia.org/T389399) (owner: 10Bernard Wang) [20:27:02] 10ops-esams, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389874 (10phaultfinder) 03NEW [20:27:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.561s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:27:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10671315 (10Jclark-ctr) 05Open→03Resolved rebooted server and cleared errors cam... [20:27:30] !log brennen@deploy1003 Finished scap sync-world: sync world after keyholder errors during spiderpig testing (duration: 02m 47s) [20:28:01] (03PS1) 10Ahmon Dancy: profile::keyholder::server::agents: Add deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1130715 [20:28:40] bwang: you're up next [20:28:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130652 (https://phabricator.wikimedia.org/T389399) (owner: 10Bernard Wang) [20:29:31] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host relforge1008.eqiad.wmnet with OS bullseye [20:29:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10671326 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host relforge100... [20:29:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host relforge1009.eqiad.wmnet with OS bullseye [20:29:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10671327 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host relforge100... [20:29:58] (03CR) 10Ahmon Dancy: "This is to ensure that the spiderpig user (which is not a member of the wikidev group) can deploy to bare metal servers." [puppet] - 10https://gerrit.wikimedia.org/r/1130715 (owner: 10Ahmon Dancy) [20:30:16] (03Merged) 10jenkins-bot: Disable Search AB Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130652 (https://phabricator.wikimedia.org/T389399) (owner: 10Bernard Wang) [20:30:28] !log brennen@deploy1003 Started scap sync-world: Backport for [[gerrit:1130652|Disable Search AB Test (T389399)]] [20:30:32] T389399: Disable search A/B test - https://phabricator.wikimedia.org/T389399 [20:31:22] Ty! [20:32:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:32:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-ext releases routed via main (k8s) 1.494s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:35:01] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:35:17] !log brennen@deploy1003 bwang, brennen: Backport for [[gerrit:1130652|Disable Search AB Test (T389399)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:35:32] bwang: let me know when ready to proceed [20:35:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:35:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2258.codfw.wmnet with OS bookworm [20:36:05] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671358 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2258.codfw.wmnet with OS... [20:36:26] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671359 (10Jhancock.wm) [20:37:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.63% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:37:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-ext releases routed via main (k8s) 1.494s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:40:17] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1009.eqiad.wmnet with reason: host reimage [20:40:26] bwang: any testing to do here? [20:42:04] zip: going to get yours going through CI [20:42:14] (03CR) 10Brennen Bearnes: [C:03+2] Don't clobber error information for failed Flow creates [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130658 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [20:42:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1009.eqiad.wmnet with reason: host reimage [20:44:18] (03Merged) 10jenkins-bot: Don't clobber error information for failed Flow creates [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130658 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [20:44:58] bwang: are you testing or should we revert for now? [20:45:58] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1008.eqiad.wmnet with reason: host reimage [20:45:59] Sorry I ust need 1 sec [20:46:10] no worries, thanks for the update [20:46:16] Which test server? [20:46:22] any [20:46:49] Ok looks good! [20:47:23] ty! going ahead [20:47:28] !log brennen@deploy1003 bwang, brennen: Continuing with sync [20:48:16] kimberly_sarabia: you'll be up next once this finishes [20:48:25] Ok thanks! [20:48:29] brennen: copy that [20:48:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1008.eqiad.wmnet with reason: host reimage [20:48:40] Hasn't Zoe's change already merged? [20:49:12] sure enough [20:49:37] guess we'll go ahead with that one then kimberly_sarabia's config change last [20:50:28] should be [20:50:58] actually ppperry do you have backport privs, maybe i should have checked in with you first [20:51:09] I don't [20:51:16] I'm just a volunteer with no access to anything [20:51:25] You're far from the first person to think I'm a deployer though [20:52:03] haha [20:52:17] (03CR) 10BCornwall: [C:03+2] upgrade cp7006 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130665 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:52:28] I'm not even that frequent of a backport window requester [20:52:28] (03PS2) 10BCornwall: upgrade cp7014 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130673 (https://phabricator.wikimedia.org/T378737) [20:52:32] Pppery: careful, you might wind up with deployment privileges. [20:52:33] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp7014 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130673 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:54:00] I'm not even sure if I want deployment access [20:54:41] !log brennen@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130652|Disable Search AB Test (T389399)]] (duration: 24m 12s) [20:54:41] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7006.magru.wmnet} and A:cp [20:54:46] T389399: Disable search A/B test - https://phabricator.wikimedia.org/T389399 [20:54:47] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7014.magru.wmnet} and A:cp [20:55:00] yeah, best not to volunteer so hard that you're just an unpaid employee [20:55:00] ok zip, starting with yours [20:55:34] !log brennen@deploy1003 Started scap sync-world: Backport for [[gerrit:1130658|Don't clobber error information for failed Flow creates (T380911)]] [20:55:38] T380911: Run Flow migration script at *Phase 2b* wikis - https://phabricator.wikimedia.org/T380911 [20:56:25] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:56:44] > yeah, best not to volunteer so hard that you're just an unpaid employee -- that's the core of our tech volunteers though. [20:56:51] oh [20:56:56] I guess I shouldn't have said that out loud then :D [20:57:00] haha [20:59:32] !log brennen@deploy1003 zoe, brennen: Backport for [[gerrit:1130658|Don't clobber error information for failed Flow creates (T380911)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:59:41] zip: ready for test [20:59:59] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7014.magru.wmnet} and A:cp [21:00:05] Reedy, sbassett, Maryum, and manfredi: Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T2100). Please do the needful. [21:00:11] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7006.magru.wmnet} and A:cp [21:00:50] Security folks we have another config change in UTC late backport, FYI [21:01:55] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [21:02:21] brennen: no test needed, it's a script changing [21:02:36] oh hang on no [21:03:25] it's a change to permissions for changing Flow boards, which should be a no-op [21:04:47] cool [21:04:59] I'll check I still can't create one... [21:05:17] !log brennen@deploy1003 zoe, brennen: Continuing with sync [21:05:57] The code is covered by tests, at least [21:06:13] And I had to update the test, which was expecting the permission fail with one error, to instead fail with a different error [21:06:40] sorry, absolute space cadet moment there. it should be a harmless change, and as you say, tested [21:06:57] verified that I have not magically gained the ability to create flow pages [21:07:31] Given that it's disabled almost everywhere, with only a few places even allowing sysops to create pages, I am satisfied I've checked the important case [21:08:06] still, my apologies for not thinking about this ahead of time, I was totally in "this fixes the script" mode [21:08:25] anyway, in the morning I'll see what the script does :) [21:08:41] zip: thanks. comfortable with currently ongoing deploy then. :) [21:08:45] yes [21:09:01] worst case I get a sticker I guess [21:09:55] I tested this and confirmed it changes the error message like it should [21:10:12] great! [21:10:27] (03PS1) 10Ahmon Dancy: scap-master-sync: Clean up orphaned php-* directories after rsync [puppet] - 10https://gerrit.wikimedia.org/r/1130723 (https://phabricator.wikimedia.org/T389830) [21:10:49] With this patch, trying to move a Flow board: "The action you have requested is limited to users in the group: Structured Discussions bots." [21:11:09] Without this patch it was giving a message about not having the Flow-board permission [21:11:14] I was testing on gomwiki [21:11:54] FYI I'll likely be asleep when you run the script in the morning, but I'll look at the output when I get up if nobody beats me to it [21:12:32] (And I'm already well into "unpaid employee" mode on Flow deprecation stuff, by the way - I wrote the entire script used to export Flow boards pretty much from scratch) [21:12:38] !log brennen@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130658|Don't clobber error information for failed Flow creates (T380911)]] (duration: 17m 04s) [21:12:45] T380911: Run Flow migration script at *Phase 2b* wikis - https://phabricator.wikimedia.org/T380911 [21:13:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130170 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [21:13:30] (03PS2) 10Ahmon Dancy: scap-master-sync: Clean up orphaned php-* directories after rsync [puppet] - 10https://gerrit.wikimedia.org/r/1130723 (https://phabricator.wikimedia.org/T389830) [21:13:32] kimberly_sarabia: k, will ping when ready for test [21:14:23] (03Merged) 10jenkins-bot: Deploy donate banner for all wikis except English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130170 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [21:14:34] !log brennen@deploy1003 Started scap sync-world: Backport for [[gerrit:1130170|Deploy donate banner for all wikis except English Wikipedia (T388438)]] [21:14:39] T388438: Gradual Rollout - Donate Button Deployment - https://phabricator.wikimedia.org/T388438 [21:14:44] I appreciate the help [21:14:52] this is very unfamilar territory for me [21:15:09] It's unfamiliar for everyone - I had to learn all of its quirks the hard way [21:15:24] and still missed some, like T388687 [21:15:25] T388687: Flow wikitext API doesn’t include image that’s present in a topic - https://phabricator.wikimedia.org/T388687 [21:15:30] brennen: got it [21:15:53] (03CR) 10Ahmon Dancy: "Tested in train-dev" [puppet] - 10https://gerrit.wikimedia.org/r/1130723 (https://phabricator.wikimedia.org/T389830) (owner: 10Ahmon Dancy) [21:19:33] !log brennen@deploy1003 ksarabia, brennen: Backport for [[gerrit:1130170|Deploy donate banner for all wikis except English Wikipedia (T388438)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:19:35] kimberly_sarabia: k, ready for test [21:21:01] brennen: LGTM (can see the new donate banner in French wiki mobile sidebar) [21:21:09] cool, going ahead [21:21:10] !log brennen@deploy1003 ksarabia, brennen: Continuing with sync [21:24:43] (03CR) 10BCornwall: [C:03+2] upgrade cp7007 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130666 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:24:53] (03PS2) 10BCornwall: upgrade cp7011 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130670 (https://phabricator.wikimedia.org/T378737) [21:24:56] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp7011 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130670 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:26:16] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10671545 (10RobH) > Thanks for the logs, Rob, > > > Here are the key points from our analysis: > > > > None of the hosts showed temperature alerts for any of the sensors in the last 2 years. As w... [21:26:22] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7007.magru.wmnet} and A:cp [21:26:23] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7011.magru.wmnet} and A:cp [21:28:06] !log brennen@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130170|Deploy donate banner for all wikis except English Wikipedia (T388438)]] (duration: 13m 31s) [21:28:10] T388438: Gradual Rollout - Donate Button Deployment - https://phabricator.wikimedia.org/T388438 [21:28:49] !log end of UTC late backport & config window [21:28:51] thanks all. [21:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:23] Thank you! [21:31:56] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7011.magru.wmnet} and A:cp [21:32:02] 10ops-ulsfo, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389884 (10phaultfinder) 03NEW [21:32:02] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7007.magru.wmnet} and A:cp [21:35:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:36:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2225:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2225 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:37:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int releases routed via main (k8s) 1.074s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:41:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2225:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2225 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:41:58] (03CR) 10Reedy: [C:03+2] PopulateLocalAndGlobalIds: Don't set lu_local_id if we don't have a mapping... [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130712 (https://phabricator.wikimedia.org/T303590) (owner: 10Reedy) [21:42:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int releases routed via main (k8s) 1.074s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:42:48] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:44:22] 06SRE, 10Thumbor: Thumbnail failures on some SVGs - https://phabricator.wikimedia.org/T389060#10671626 (10Scott_French) I am neither a thumbor nor SVG expert by any means, but: Spot checking one of the two examples and correlating with thumbor error logs, I believe `rsvg-convert` is failing with `rendering err... [21:45:54] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10671632 (10RobH) > Hi Rob, > > There are a couple of issues with the NVMe drive in slot 1. First, it is not compatible with the R450 system. Second, the drive is older than the system itself and was ad... [21:52:06] (03Merged) 10jenkins-bot: PopulateLocalAndGlobalIds: Don't set lu_local_id if we don't have a mapping... [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130712 (https://phabricator.wikimedia.org/T303590) (owner: 10Reedy) [21:53:04] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2259 to codfw - jhancock@cumin2002" [21:53:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2259 to codfw - jhancock@cumin2002" [21:53:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:53:38] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2259 [21:53:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2259 [22:04:54] !log reedy@deploy1003 Synchronized php-1.44.0-wmf.21/extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php: T303590 (duration: 11m 56s) [22:04:58] T303590: Fix localuser rows with lu_local_id=0 or lu_global_id=0 - https://phabricator.wikimedia.org/T303590 [22:05:05] (03PS1) 10Arlolra: Allow dot in revision title [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130728 (https://phabricator.wikimedia.org/T389628) [22:05:18] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:05:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1009.eqiad.wmnet with OS bullseye [22:05:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:05:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1008.eqiad.wmnet with OS bullseye [22:05:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10671674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host relforge1009.eq... [22:05:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10671678 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host relforge1008.eq... [22:06:50] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:07:23] (03CR) 10Subramanya Sastry: [C:03+1] "We will retry this next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123487 (https://phabricator.wikimedia.org/T356718) (owner: 10Arlolra) [22:07:54] (03CR) 10Arlolra: Enable Parsoid read views for a few wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123487 (https://phabricator.wikimedia.org/T356718) (owner: 10Arlolra) [22:11:52] (03CR) 10BCornwall: [C:03+2] upgrade cp7008 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130667 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [22:12:03] (03PS2) 10BCornwall: upgrade cp7010 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130669 (https://phabricator.wikimedia.org/T378737) [22:12:06] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp7010 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130669 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [22:12:15] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1119.eqiad.wmnet with OS bullseye [22:12:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10671696 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1119... [22:13:12] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7008.magru.wmnet} and A:cp [22:13:14] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7010.magru.wmnet} and A:cp [22:13:49] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2260 to codfw - jhancock@cumin2002" [22:13:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2260 to codfw - jhancock@cumin2002" [22:13:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:14:44] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2262 [22:14:45] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2260 [22:14:46] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2261 [22:14:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2262 [22:14:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker2261 [22:14:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2260 [22:15:07] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2261 [22:15:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2261 [22:15:20] (03PS2) 10Arlolra: Enable Parsoid read views for a few wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123487 (https://phabricator.wikimedia.org/T356718) [22:16:34] 06SRE, 10MediaWiki-Core-AuthManager, 10MediaWiki-Debug-Logger, 06MediaWiki-Platform-Team: Throttler IP logging uses internal IPs - https://phabricator.wikimedia.org/T389887 (10Tgr) 03NEW [22:17:12] 06SRE, 10MediaWiki-Core-AuthManager, 10MediaWiki-Debug-Logger, 06MediaWiki-Platform-Team: Throttler IP logging uses internal IPs - https://phabricator.wikimedia.org/T389887#10671712 (10Tgr) [22:18:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10671713 (10bking) a:03Jclark-ctr [22:18:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10671716 (10phaultfinder) [22:18:39] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7010.magru.wmnet} and A:cp [22:18:40] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:18:47] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7008.magru.wmnet} and A:cp [22:19:55] 06SRE, 10MediaWiki-Core-AuthManager, 10MediaWiki-Debug-Logger, 06MediaWiki-Platform-Team: Throttler IP logging uses internal IPs - https://phabricator.wikimedia.org/T389887#10671732 (10Reedy) `wgSoftBlockRanges` suggests so `lang=php // Soft-block private IPs and other shared resources. // Note this has n... [22:20:59] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:23:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2263 to codfw - jhancock@cumin2002" [22:23:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2263 to codfw - jhancock@cumin2002" [22:23:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:23:38] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2263 [22:23:45] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2264 [22:23:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2263 [22:23:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2264 [22:24:05] !log Deployed security fix for T358689 [22:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2259.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:25:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2260.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:25:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2261.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:25:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2262.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:25:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2263.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:26:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2264.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:26:08] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2261.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:26:12] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2263.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:26:29] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10671759 (10RobH) > Hi Rob > > Can you please swap NVME in slot 1 with NVME in slot 3 just to verify that there not a slot issue ? > Dell Support, > > We'd have to dispatch a remote hands reques... [22:26:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2261.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:27:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2263.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:27:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:28:03] (03CR) 10BCornwall: [C:03+2] upgrade cp7009 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130668 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [22:29:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:30:28] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7009.magru.wmnet} and A:cp [22:31:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:31:56] (03PS1) 10Bking: WIP: wdqs: Add alerts for no lag metrics reported [alerts] - 10https://gerrit.wikimedia.org/r/1130730 (https://phabricator.wikimedia.org/T389859) [22:33:12] (03CR) 10CI reject: [V:04-1] WIP: wdqs: Add alerts for no lag metrics reported [alerts] - 10https://gerrit.wikimedia.org/r/1130730 (https://phabricator.wikimedia.org/T389859) (owner: 10Bking) [22:35:57] (03PS2) 10Cwhite: prometheus: add recording rules for use by histogram_quantile [puppet] - 10https://gerrit.wikimedia.org/r/1130689 (https://phabricator.wikimedia.org/T383963) [22:36:01] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7009.magru.wmnet} and A:cp [22:36:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2259.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:36:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2262.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:36:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2260.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:36:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2264.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:37:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2261.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:37:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2263.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:38:16] (03PS1) 10BCornwall: upgrade cp5017 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130731 (https://phabricator.wikimedia.org/T378737) [22:38:17] (03PS1) 10BCornwall: upgrade cp5018 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130732 (https://phabricator.wikimedia.org/T378737) [22:38:19] (03PS1) 10BCornwall: upgrade cp5019 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130733 (https://phabricator.wikimedia.org/T378737) [22:38:20] (03PS1) 10BCornwall: upgrade cp5020 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130734 (https://phabricator.wikimedia.org/T378737) [22:38:22] (03PS1) 10BCornwall: upgrade cp5021 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130735 (https://phabricator.wikimedia.org/T378737) [22:38:23] (03PS1) 10BCornwall: upgrade cp5022 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130736 (https://phabricator.wikimedia.org/T378737) [22:38:24] (03PS1) 10BCornwall: upgrade cp5023 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130737 (https://phabricator.wikimedia.org/T378737) [22:38:27] (03PS1) 10BCornwall: upgrade cp5024 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130738 (https://phabricator.wikimedia.org/T378737) [22:38:31] (03PS1) 10BCornwall: upgrade cp5025 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130739 (https://phabricator.wikimedia.org/T378737) [22:38:35] (03PS1) 10BCornwall: upgrade cp5026 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130740 (https://phabricator.wikimedia.org/T378737) [22:38:39] (03PS1) 10BCornwall: upgrade cp5027 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130741 (https://phabricator.wikimedia.org/T378737) [22:38:43] (03PS1) 10BCornwall: upgrade cp5028 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130742 (https://phabricator.wikimedia.org/T378737) [22:38:47] (03PS1) 10BCornwall: upgrade cp5029 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130743 (https://phabricator.wikimedia.org/T378737) [22:38:51] (03PS1) 10BCornwall: upgrade cp5030 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130744 (https://phabricator.wikimedia.org/T378737) [22:38:55] (03PS1) 10BCornwall: upgrade cp5031 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130745 (https://phabricator.wikimedia.org/T378737) [22:39:00] (03PS1) 10BCornwall: upgrade cp5032 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130746 (https://phabricator.wikimedia.org/T378737) [22:42:48] (03CR) 10Cwhite: "These are to support queries like:" [puppet] - 10https://gerrit.wikimedia.org/r/1130689 (https://phabricator.wikimedia.org/T383963) (owner: 10Cwhite) [22:43:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10671829 (10phaultfinder) [22:44:07] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:46:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2259.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:46:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2260.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:46:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2261.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:46:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2262.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:46:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2263.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:46:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2264.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:46:56] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2261.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:46:59] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2263.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:47:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2261.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:47:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2263.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:50:24] (03PS2) 10Bking: WIP: wdqs: Add alerts for no lag metrics reported [alerts] - 10https://gerrit.wikimedia.org/r/1130730 (https://phabricator.wikimedia.org/T389859) [22:52:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2259.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:52:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2262.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:52:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2260.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:52:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2261.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:53:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2263.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:53:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2264.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:55:26] (03PS3) 10Bking: WIP: wdqs: Add alerts for no lag metrics reported [alerts] - 10https://gerrit.wikimedia.org/r/1130730 (https://phabricator.wikimedia.org/T389859) [22:55:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:59:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.14% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250324T2300) [23:00:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2259.codfw.wmnet with OS bookworm [23:00:45] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671844 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2259.codfw.wmnet with... [23:00:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2260.codfw.wmnet with OS bookworm [23:00:51] (03CR) 10CI reject: [V:04-1] WIP: wdqs: Add alerts for no lag metrics reported [alerts] - 10https://gerrit.wikimedia.org/r/1130730 (https://phabricator.wikimedia.org/T389859) (owner: 10Bking) [23:00:53] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2260.codfw.wmnet with... [23:00:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2261.codfw.wmnet with OS bookworm [23:01:06] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2261.codfw.wmnet with... [23:01:27] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [23:04:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:05:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10671850 (10phaultfinder) [23:10:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10671884 (10VRiley-WMF) Replaced the drives in an-worker1178 and an-worker1179. Will start provisiong them [23:11:28] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [23:12:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2260.codfw.wmnet with reason: host reimage [23:12:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2261.codfw.wmnet with reason: host reimage [23:13:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1119.eqiad.wmnet with OS bullseye [23:13:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10671893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1119... [23:14:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:14:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:15:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2260.codfw.wmnet with reason: host reimage [23:18:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2261.codfw.wmnet with reason: host reimage [23:23:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10671901 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1119.eqi... [23:27:21] (03CR) 10Cwhite: [C:03+2] add statsv throughput alerts [alerts] - 10https://gerrit.wikimedia.org/r/1129899 (https://phabricator.wikimedia.org/T389469) (owner: 10Cwhite) [23:28:40] (03Merged) 10jenkins-bot: add statsv throughput alerts [alerts] - 10https://gerrit.wikimedia.org/r/1129899 (https://phabricator.wikimedia.org/T389469) (owner: 10Cwhite) [23:30:33] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:36:02] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:38:08] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:38:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2261.codfw.wmnet with OS bookworm [23:38:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:38:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2260.codfw.wmnet with OS bookworm [23:38:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671944 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2261.codfw.wmnet with OS... [23:38:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2260.codfw.wmnet with OS... [23:41:15] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10671947 (10Jhancock.wm) [23:48:22] (03CR) 10Ssingh: [C:03+1] upgrade cp5017 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130731 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:49:20] (03CR) 10Ssingh: [C:03+1] upgrade cp5018 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130732 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:49:21] (03CR) 10Ssingh: [C:03+1] upgrade cp5019 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130733 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:49:22] (03CR) 10Ssingh: [C:03+1] upgrade cp5020 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130734 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:49:23] (03CR) 10Ssingh: [C:03+1] upgrade cp5021 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130735 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:49:24] (03CR) 10Ssingh: [C:03+1] upgrade cp5022 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130736 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:49:25] (03CR) 10Ssingh: [C:03+1] upgrade cp5023 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130737 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:49:27] (03CR) 10Ssingh: [C:03+1] upgrade cp5024 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130738 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:49:31] (03CR) 10Ssingh: [C:03+1] upgrade cp5025 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130739 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:49:35] (03CR) 10Ssingh: [C:03+1] upgrade cp5026 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130740 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:49:39] (03CR) 10Ssingh: [C:03+1] upgrade cp5027 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130741 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:49:43] (03CR) 10Ssingh: [C:03+1] upgrade cp5028 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130742 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:49:47] (03CR) 10Ssingh: [C:03+1] upgrade cp5029 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130743 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:49:51] (03CR) 10Ssingh: [C:03+1] upgrade cp5030 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130744 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:49:55] (03CR) 10Ssingh: [C:03+1] upgrade cp5031 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130745 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:50:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10671997 (10Jclark-ctr) [23:51:34] (03PS1) 10Gergő Tisza: Enable SUL3 login for 10% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130752 (https://phabricator.wikimedia.org/T384219) [23:53:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130752 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [23:54:08] 06SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875#10672024 (10AntiCompositeNumber)