[00:04:16] FIRING: SystemdUnitFailed: mediawiki_job_ImageSuggestions_SendNotificationsForUnillustratedWatchedTitles_CA.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:41] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 628.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:12:24] !log zabe@mwmaint2002:~$ cat /srv/mediawiki-staging/dblists/group1.dblist | xargs -I{} bash -c "echo {}; mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php {} --delete /home/zabe/text_table_cleanup/{} --sleep 0.3" # T183490 [00:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:27] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [00:38:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1118881 [00:38:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1118881 (owner: 10TrainBranchBot) [00:40:47] PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100% [00:48:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1118881 (owner: 10TrainBranchBot) [00:51:01] RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 86.48 ms [01:08:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1118882 [01:08:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1118882 (owner: 10TrainBranchBot) [01:29:44] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1118882 (owner: 10TrainBranchBot) [01:37:37] (03PS1) 10Zabe: beta: Drop old wgGlobalBlockingAllowGlobalAccountBlocks flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118886 (https://phabricator.wikimedia.org/T386132) [01:38:35] (03CR) 10Zabe: [C:03+2] beta: Drop old wgGlobalBlockingAllowGlobalAccountBlocks flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118886 (https://phabricator.wikimedia.org/T386132) (owner: 10Zabe) [01:39:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118886 (https://phabricator.wikimedia.org/T386132) (owner: 10Zabe) [01:39:17] (03Merged) 10jenkins-bot: beta: Drop old wgGlobalBlockingAllowGlobalAccountBlocks flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118886 (https://phabricator.wikimedia.org/T386132) (owner: 10Zabe) [01:40:41] PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (203726s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [01:42:41] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:49:08] (03PS1) 10Zabe: MCR Stage 4: Reduce dewiktionary revision-slots cache expiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118888 [01:49:16] (03CR) 10Zabe: [C:03+2] MCR Stage 4: Reduce dewiktionary revision-slots cache expiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118888 (owner: 10Zabe) [01:49:54] (03Merged) 10jenkins-bot: MCR Stage 4: Reduce dewiktionary revision-slots cache expiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118888 (owner: 10Zabe) [01:50:56] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1118888|MCR Stage 4: Reduce dewiktionary revision-slots cache expiry]] [01:54:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [01:55:31] !log zabe@deploy2002 zabe: Backport for [[gerrit:1118888|MCR Stage 4: Reduce dewiktionary revision-slots cache expiry]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [01:55:47] !log zabe@deploy2002 zabe: Continuing with sync [01:59:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [02:02:43] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118888|MCR Stage 4: Reduce dewiktionary revision-slots cache expiry]] (duration: 11m 46s) [02:11:41] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:13:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.189s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:18:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.189s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:21] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:02:11] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:12:10] (03PS1) 10Arlolra: Bust cache for recreated pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118890 [03:50:53] 06SRE, 06Infrastructure-Foundations: Use FIDO2 ssh keys for production access - https://phabricator.wikimedia.org/T385229#10542574 (10cmooney) 05Open→03Resolved This is merged and all works as expected. I can confirm Homer also works ok directly from my laptop, via our bastions. When connecting it co... [04:07:41] FIRING: SystemdUnitFailed: mediawiki_job_ImageSuggestions_SendNotificationsForUnillustratedWatchedTitles_CA.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:16] FIRING: [2x] SystemdUnitFailed: mediawiki_job_ImageSuggestions_SendNotificationsForUnillustratedWatchedTitles_CA.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:40:41] RECOVERY - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is OK: wikitech-static OK - wikitech and wikitech-static in sync (126387 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [05:42:41] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T0700) [08:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T0800) [08:00:05] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:01:08] (03PS1) 10Brouberol: airflow-analytics: specify the path to the SSL CA certificate bundle [puppet] - 10https://gerrit.wikimedia.org/r/1119043 (https://phabricator.wikimedia.org/T386092) [08:02:07] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4936/co" [puppet] - 10https://gerrit.wikimedia.org/r/1119043 (https://phabricator.wikimedia.org/T386092) (owner: 10Brouberol) [08:02:48] o/ [08:03:05] I will deploy with gmodena [08:03:20] o/ [08:04:09] (03CR) 10Brouberol: [C:03+1] dse-k8s: Stop installing the amd rocm packages to dse-k8s-worker1001 [puppet] - 10https://gerrit.wikimedia.org/r/1118846 (https://phabricator.wikimedia.org/T377875) (owner: 10Btullis) [08:05:16] (03CR) 10DCausse: [C:03+1] cirrus: create buckets for mlr 2025 experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118783 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [08:06:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118783 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [08:06:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118782 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [08:07:28] (03Merged) 10jenkins-bot: cirrus: deploy new mlr models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118782 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [08:07:30] (03Merged) 10jenkins-bot: cirrus: create buckets for mlr 2025 experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118783 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [08:08:18] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1118783|cirrus: create buckets for mlr 2025 experiment (T385972)]], [[gerrit:1118782|cirrus: deploy new mlr models (T385972)]] [08:08:21] T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972 [08:11:21] !log dcausse@deploy2002 dcausse, gmodena: Backport for [[gerrit:1118783|cirrus: create buckets for mlr 2025 experiment (T385972)]], [[gerrit:1118782|cirrus: deploy new mlr models (T385972)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:13:03] 06SRE, 10observability, 10Wikimedia-Logstash, 07Epic: [Epic] Migrate log transport to kafka for Search Platform applications - https://phabricator.wikimedia.org/T224911#10542693 (10Gehel) [08:13:35] 06SRE, 10Elasticsearch, 07Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#10542711 (10Gehel) [08:15:25] RECOVERY - Categories update lag on wdqs2022 is OK: OK - Categories lag: 3:15:24.357554 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [08:17:33] (03PS1) 10Gmodena: cirrus: update ltr model on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119045 (https://phabricator.wikimedia.org/T385972) [08:18:41] !log dcausse@deploy2002 dcausse, gmodena: Continuing with sync [08:18:53] (03CR) 10DCausse: [C:03+1] cirrus: update ltr model on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119045 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [08:23:44] we'll have a quick followup to deploy after this one (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1119045) [08:25:22] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118783|cirrus: create buckets for mlr 2025 experiment (T385972)]], [[gerrit:1118782|cirrus: deploy new mlr models (T385972)]] (duration: 17m 03s) [08:25:25] T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972 [08:26:32] (03PS1) 10Gehel: alertmanager: send Search Platform alerts to main phab board [puppet] - 10https://gerrit.wikimedia.org/r/1119048 [08:27:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119045 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [08:28:27] (03Merged) 10jenkins-bot: cirrus: update ltr model on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119045 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [08:28:57] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1119045|cirrus: update ltr model on enwiki (T385972)]] [08:31:58] !log dcausse@deploy2002 gmodena, dcausse: Backport for [[gerrit:1119045|cirrus: update ltr model on enwiki (T385972)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:32:01] T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972 [08:34:22] 06SRE, 10Elasticsearch: Have dedicated master nodes for elasticsearch - https://phabricator.wikimedia.org/T130590#10542879 (10Gehel) [08:35:21] 06SRE, 10Elasticsearch: Investigate the need for master only (non data nodes) in our ES cluster - https://phabricator.wikimedia.org/T109090#10542916 (10Gehel) [08:35:33] !log dcausse@deploy2002 gmodena, dcausse: Continuing with sync [08:39:25] 06SRE, 06collaboration-services, 06serviceops, 13Patch-For-Review, 07Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296#10543018 (10Gehel) [08:40:52] 06SRE, 10CirrusSearch, 10envoy, 06Infrastructure-Foundations, and 4 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291#10543055 (10Gehel) [08:42:07] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119045|cirrus: update ltr model on enwiki (T385972)]] (duration: 13m 10s) [08:42:10] T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972 [08:42:17] (03CR) 10DCausse: [C:03+1] alertmanager: send Search Platform alerts to main phab board [puppet] - 10https://gerrit.wikimedia.org/r/1119048 (owner: 10Gehel) [08:49:30] (03PS1) 10KartikMistry: Update cxserver to 2025-02-12-075258-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119052 (https://phabricator.wikimedia.org/T381943) [08:49:35] !log closing the UTC morning backport widow [08:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118497 (https://phabricator.wikimedia.org/T385960) (owner: 10Dragoniez) [08:57:32] (03CR) 10Gehel: [C:03+2] alertmanager: send Search Platform alerts to main phab board [puppet] - 10https://gerrit.wikimedia.org/r/1119048 (owner: 10Gehel) [09:00:04] andre and jnuche: That opportune time for a MediaWiki train - Utc-0 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T0900). [09:03:09] (03PS1) 10Brouberol: ES/rolling-operation: enforce --nodes-per-run=1 on relforge [cookbooks] - 10https://gerrit.wikimedia.org/r/1119056 (https://phabricator.wikimedia.org/T380752) [09:07:41] FIRING: [2x] SystemdUnitFailed: mediawiki_job_ImageSuggestions_SendNotificationsForUnillustratedWatchedTitles_CA.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:23] (03PS1) 10Brouberol: ES/rolling-operation: add a optional flag to ask for confirmation before running operation [cookbooks] - 10https://gerrit.wikimedia.org/r/1119058 (https://phabricator.wikimedia.org/T380752) [09:14:36] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119059 (https://phabricator.wikimedia.org/T382367) [09:14:37] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119059 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot) [09:15:24] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119059 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot) [09:15:39] (03PS2) 10Brouberol: ES/rolling-operation: add a optional flag to ask for confirmation before running operation [cookbooks] - 10https://gerrit.wikimedia.org/r/1119058 (https://phabricator.wikimedia.org/T380752) [09:16:20] !log brouberol@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test rolling-operation cookbook - brouberol@cumin2002 [09:16:33] !log brouberol@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test rolling-operation cookbook - brouberol@cumin2002 [09:18:27] (03PS3) 10Brouberol: ES/rolling-operation: add a optional flag to ask for confirmation before running operation [cookbooks] - 10https://gerrit.wikimedia.org/r/1119058 (https://phabricator.wikimedia.org/T380752) [09:18:32] !log brouberol@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test rolling-operation cookbook - brouberol@cumin2002 [09:18:43] !log brouberol@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test rolling-operation cookbook - brouberol@cumin2002 [09:24:10] (03CR) 10CI reject: [V:04-1] ES/rolling-operation: add a optional flag to ask for confirmation before running operation [cookbooks] - 10https://gerrit.wikimedia.org/r/1119058 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [09:24:34] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.16 refs T382367 [09:24:38] T382367: 1.44.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T382367 [09:27:18] !log brouberol@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test rolling-operation cookbook - brouberol@cumin2002 [09:27:22] !log brouberol@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test rolling-operation cookbook - brouberol@cumin2002 [09:38:58] (03CR) 10Dragoniez: "I'm ready to join #wikimedia-operations and have the WikimediaDebug extension installed. Once the backport window opens, will I simply nee" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118497 (https://phabricator.wikimedia.org/T385960) (owner: 10Dragoniez) [09:39:32] (03CR) 10Klausman: [C:03+1] "I _think_ a Bookworm host without the rocm line will be monitored for GPU stuff just fine, e.g. ml-staging2003 works that way." [puppet] - 10https://gerrit.wikimedia.org/r/1118846 (https://phabricator.wikimedia.org/T377875) (owner: 10Btullis) [09:44:31] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:57:09] PROBLEM - Host ms-be2075 is DOWN: PING CRITICAL - Packet loss = 100% [10:20:06] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119074 (https://phabricator.wikimedia.org/T382367) [10:20:08] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119074 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot) [10:21:16] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119074 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot) [10:22:16] (03CR) 10Btullis: [C:03+2] dse-k8s: Use partman recipes for containerd with local storage support [puppet] - 10https://gerrit.wikimedia.org/r/1118844 (https://phabricator.wikimedia.org/T377875) (owner: 10Btullis) [10:22:26] (03CR) 10Btullis: [C:03+2] dse-k8s: Stop installing the amd rocm packages to dse-k8s-worker1001 [puppet] - 10https://gerrit.wikimedia.org/r/1118846 (https://phabricator.wikimedia.org/T377875) (owner: 10Btullis) [10:30:29] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.16 refs T382367 [10:30:33] T382367: 1.44.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T382367 [10:34:31] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:38:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T1100) [11:14:49] RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [11:31:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118541 (owner: 10Sergio Gimeno) [11:32:19] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:38:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:50:59] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:54:57] (03PS1) 10Stevemunene: update dsek8s cluster to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119087 (https://phabricator.wikimedia.org/T377875) [11:55:18] (03CR) 10CI reject: [V:04-1] update dsek8s cluster to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119087 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [11:59:09] (03PS10) 10Brouberol: mediawiki: Add support for dumps suspended job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104605 (https://phabricator.wikimedia.org/T352650) (owner: 10Giuseppe Lavagetto) [12:00:05] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T1200). [12:01:46] (03PS11) 10Brouberol: mediwiki-dumps-legacy: Create helmfile deployment of a suspended job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114001 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis) [12:03:45] (03PS1) 10Mvolz: Update Zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119088 [12:04:13] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118492 (owner: 10PipelineBot) [12:05:36] (03PS2) 10Stevemunene: update dsek8s cluster to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119087 (https://phabricator.wikimedia.org/T377875) [12:05:36] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118492 (owner: 10PipelineBot) [12:05:57] (03CR) 10CI reject: [V:04-1] update dsek8s cluster to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119087 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [12:07:46] (03CR) 10Btullis: update dsek8s cluster to use containerd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1119087 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [12:08:48] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:09:15] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:11:23] (03PS3) 10Stevemunene: update dsek8s cluster to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119087 (https://phabricator.wikimedia.org/T377875) [12:11:38] (03CR) 10Btullis: update dsek8s cluster to use containerd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1119087 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [12:11:42] (03CR) 10CI reject: [V:04-1] update dsek8s cluster to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119087 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [12:11:57] (03CR) 10Btullis: update dsek8s cluster to use containerd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1119087 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [12:13:00] (03PS4) 10Stevemunene: update dsek8s cluster to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119087 (https://phabricator.wikimedia.org/T377875) [12:13:15] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:13:41] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:13:57] (03CR) 10Stevemunene: update dsek8s cluster to use containerd (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1119087 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [12:14:24] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:14:53] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:17:02] (03PS5) 10Stevemunene: update dsek8s cluster to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119087 (https://phabricator.wikimedia.org/T377875) [12:19:21] (03CR) 10Mvolz: [C:03+2] Update Zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119088 (owner: 10Mvolz) [12:19:32] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1119087 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [12:20:46] (03Merged) 10jenkins-bot: Update Zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119088 (owner: 10Mvolz) [12:22:37] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [12:23:20] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [12:25:59] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [12:26:27] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [12:27:20] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [12:27:58] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [12:32:23] (03CR) 10Stevemunene: [C:03+2] update dsek8s cluster to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119087 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [12:32:45] (03CR) 10Michael Große: "Moving this forward would be great, so that `$__rate_interval` works as expected. I don't have strong opinions regarding setting it to 30s" [puppet] - 10https://gerrit.wikimedia.org/r/1058106 (https://phabricator.wikimedia.org/T371102) (owner: 10Filippo Giunchedi) [12:35:37] (03PS1) 10Cory Massaro: wikifunctions: Upgrade orchestrator from 2025-02-03-215824 to 2025-02-11-155417. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119094 (https://phabricator.wikimedia.org/T379977) [12:40:28] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm [12:44:36] (03PS1) 10Máté Szabó: Use original connection handle in onTransactionPreCommitOrIdle() [extensions/CheckUser] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119099 (https://phabricator.wikimedia.org/T386171) [12:45:03] jouncebot: nowandnext [12:45:03] For the next 0 hour(s) and 14 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T1200) [12:45:03] In 1 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T1400) [12:47:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119099 (https://phabricator.wikimedia.org/T386171) (owner: 10Máté Szabó) [12:50:45] (03PS1) 10Stevemunene: Change dse-k8s-worker1002 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119102 (https://phabricator.wikimedia.org/T377875) [12:50:47] (03PS1) 10Stevemunene: Change dse-k8s-worker1003 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119103 (https://phabricator.wikimedia.org/T377875) [12:50:48] (03PS1) 10Stevemunene: Change dse-k8s-worker1004 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119104 (https://phabricator.wikimedia.org/T377875) [12:50:50] (03PS1) 10Stevemunene: Change dse-k8s-worker1009 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119105 (https://phabricator.wikimedia.org/T377875) [12:50:51] (03PS1) 10Stevemunene: Remove docker related referrences on dse-k8s worker and master [puppet] - 10https://gerrit.wikimedia.org/r/1119106 (https://phabricator.wikimedia.org/T377875) [12:55:20] (03PS1) 10Cory Massaro: wikifunctions: Upgrade evaluators from 2025-01-30-011236 to 2025-02-11-155338. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119109 (https://phabricator.wikimedia.org/T383631) [12:57:58] (03Merged) 10jenkins-bot: Use original connection handle in onTransactionPreCommitOrIdle() [extensions/CheckUser] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119099 (https://phabricator.wikimedia.org/T386171) (owner: 10Máté Szabó) [12:58:03] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-02-03-215824 to 2025-02-11-155417. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119094 (https://phabricator.wikimedia.org/T379977) (owner: 10Cory Massaro) [12:58:30] !log mszabo@deploy2002 Started scap sync-world: Backport for [[gerrit:1119099|Use original connection handle in onTransactionPreCommitOrIdle() (T386171)]] [12:58:31] (03CR) 10Jforrester: [C:03+1] wikifunctions: Upgrade orchestrator from 2025-02-03-215824 to 2025-02-11-155417. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119094 (https://phabricator.wikimedia.org/T379977) (owner: 10Cory Massaro) [12:58:33] T386171: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'centralauth.logging' doesn't exist - https://phabricator.wikimedia.org/T386171 [12:58:39] (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-01-30-011236 to 2025-02-11-155338. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119109 (https://phabricator.wikimedia.org/T383631) (owner: 10Cory Massaro) [12:58:42] (03CR) 10Jforrester: [C:03+1] wikifunctions: Upgrade evaluators from 2025-01-30-011236 to 2025-02-11-155338. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119109 (https://phabricator.wikimedia.org/T383631) (owner: 10Cory Massaro) [13:01:33] !log mszabo@deploy2002 mszabo: Backport for [[gerrit:1119099|Use original connection handle in onTransactionPreCommitOrIdle() (T386171)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:02:54] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [13:03:22] !log mszabo@deploy2002 mszabo: Continuing with sync [13:04:36] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [13:06:44] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [13:06:58] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [13:07:05] !log stevemunene@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm [13:08:04] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [13:08:12] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [13:09:21] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm [13:09:58] !log mszabo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119099|Use original connection handle in onTransactionPreCommitOrIdle() (T386171)]] (duration: 11m 27s) [13:10:01] T386171: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'centralauth.logging' doesn't exist - https://phabricator.wikimedia.org/T386171 [13:14:00] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [13:14:34] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [13:16:17] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [13:17:08] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [13:18:58] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [13:19:43] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [13:20:23] I'm going to start promoting group1 wikis to 1.44.0-wmf.16 again (as I had to roll back earlier and the blocker fix just got backported and deployed) [13:25:18] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage [13:25:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 396MiB (2% inode=33%): /tmp 396MiB (2% inode=33%): /var/tmp 396MiB (2% inode=33%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [13:29:00] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119112 (https://phabricator.wikimedia.org/T382367) [13:29:01] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119112 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot) [13:29:54] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119112 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot) [13:31:59] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage [13:40:44] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.16 refs T382367 [13:40:47] T382367: 1.44.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T382367 [13:45:21] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [13:47:19] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:21] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm [13:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:53:17] (03PS1) 10Michael Große: refactor(AddLink): Make eval steps more legible [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119115 [13:53:35] (03PS1) 10Michael Große: feat(AddLink): store null if there is no recommendation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119116 (https://phabricator.wikimedia.org/T382270) [13:54:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119115 (owner: 10Michael Große) [13:55:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119116 (https://phabricator.wikimedia.org/T382270) (owner: 10Michael Große) [13:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T1400). [14:00:05] codders, Dragoniez, sergi0, and MichaelG_WMF: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:33] o/ [14:00:38] o/ [14:00:55] o/ [14:01:11] o/ [14:01:29] o/ [14:02:01] the band's back together - Hi Michael! :) [14:02:25] Hey there 👋 [14:02:36] I can deploy today ^^ [14:02:56] oof, that’s a lot of bad blobs in logspam-watch [14:03:41] I think there was something mentioned about bad blobs in that train-blocker email, right? [14:03:53] yeah, it’s that task [14:05:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118484 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [14:06:05] (03Merged) 10jenkins-bot: Enable fixed Wikibase RDF on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118484 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [14:06:06] (03PS1) 10FNegri: kernel-messages: add category=keyword_error [puppet] - 10https://gerrit.wikimedia.org/r/1119123 (https://phabricator.wikimedia.org/T386083) [14:06:35] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1118484|Enable fixed Wikibase RDF on Beta (T384344)]] [14:06:38] T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344 [14:06:56] ah, it does a full deployment because I touched other files, even though it’s supposed to be beta-only :S [14:06:59] I thought this one would be faster [14:06:59] ok [14:07:01] (03CR) 10Bking: [C:03+1] ES/rolling-operation: enforce --nodes-per-run=1 on relforge [cookbooks] - 10https://gerrit.wikimedia.org/r/1119056 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [14:07:42] (03PS2) 10FNegri: kernel-messages: add category=keyword_error [puppet] - 10https://gerrit.wikimedia.org/r/1119123 (https://phabricator.wikimedia.org/T386083) [14:07:49] (03CR) 10CI reject: [V:04-1] refactor(AddLink): Make eval steps more legible [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119115 (owner: 10Michael Große) [14:08:28] (03CR) 10Lucas Werkmeister (WMDE): "There’s nothing else you need to do before :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118497 (https://phabricator.wikimedia.org/T385960) (owner: 10Dragoniez) [14:08:59] (b) thanks [14:09:20] (03CR) 10Michael Große: "recheck" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119115 (owner: 10Michael Große) [14:09:35] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1118484|Enable fixed Wikibase RDF on Beta (T384344)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:10:04] codders: anything to test for this change? [14:10:05] (03CR) 10CI reject: [V:04-1] feat(AddLink): store null if there is no recommendation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119116 (https://phabricator.wikimedia.org/T382270) (owner: 10Michael Große) [14:10:10] don't think so [14:10:16] Hey folks! I see there are quite a few patches scheduled for deployment, but would it be possible to add a last-minute patch if there's enough time? (I still need to write the patch also) [14:10:21] I was just going to check that beta wasn't completely broken once its pushed [14:10:28] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:10:42] I checked that https://www.wikidata.org/wiki/Special:EntityData/Q53569537.ttl?flavor=dump shows no change (i.e. the change isn’t live in production) [14:10:54] though I guess that was moot since the code isn’t even in this week’s train yet, let alone last week’s :D [14:11:00] :) [14:11:01] Daimona: you can try ^^ [14:11:10] (03CR) 10Michael Große: "recheck flaky QUnit test as well (see T386015)" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119116 (https://phabricator.wikimedia.org/T382270) (owner: 10Michael Große) [14:12:17] Daimona: backport or config change? [14:12:36] config change for T376822 [14:12:38] T376822: Configure the CampaignEvents extension to use the event-organizer group by default - https://phabricator.wikimedia.org/T376822 [14:12:41] ok [14:17:10] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118484|Enable fixed Wikibase RDF on Beta (T384344)]] (duration: 10m 35s) [14:17:14] T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344 [14:17:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118497 (https://phabricator.wikimedia.org/T385960) (owner: 10Dragoniez) [14:17:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118541 (owner: 10Sergio Gimeno) [14:18:13] (03Merged) 10jenkins-bot: viwiki: Restrict the "changetags" permission to the sysop and bot groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118497 (https://phabricator.wikimedia.org/T385960) (owner: 10Dragoniez) [14:18:15] (03Merged) 10jenkins-bot: beta: fix typo in GEApiQueryGrowthTasksLookaheadSize variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118541 (owner: 10Sergio Gimeno) [14:18:32] MichaelG_WMF: how bad is it if only one of the two backports gets merged? [14:18:44] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1118497|viwiki: Restrict the "changetags" permission to the sysop and bot groups (T385960)]], [[gerrit:1118541|beta: fix typo in GEApiQueryGrowthTasksLookaheadSize variable]] [14:18:47] T385960: Restrict "changetags" userright to sysops and bots on Vietnamese Wikipedia - https://phabricator.wikimedia.org/T385960 [14:18:55] since the CI is looking so flaky :S [14:19:01] not worth it. I need the second one, the first one was just a refactoring [14:19:11] yeah, see the flakyness, we can post-pone [14:19:24] (03CR) 10Ottomata: [C:03+1] airflow-analytics: specify the path to the SSL CA certificate bundle [puppet] - 10https://gerrit.wikimedia.org/r/1119043 (https://phabricator.wikimedia.org/T386092) (owner: 10Brouberol) [14:19:30] well, we can still try it [14:19:35] especially that QUnit thing is really annoying - are you seeing that too for Wikibase? [14:19:41] just wondering what to do if the first change makes it through gate-and-submit and the second one doesn’t [14:19:51] I think I saw it once not long ago [14:19:53] but not regularly [14:20:16] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment, let’s see if it has better luck than the test build" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119115 (owner: 10Michael Große) [14:20:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web/next (k8s) 1.145s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:20:19] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment, let’s see if it has better luck than the test build" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119116 (https://phabricator.wikimedia.org/T382270) (owner: 10Michael Große) [14:20:42] Nothing bad happens if only the first one makes it through, that is just a no-op refactoring [14:20:57] then I have one change less to worry about in the next window [14:21:02] (03CR) 10Brouberol: [V:03+1 C:03+2] airflow-analytics: specify the path to the SSL CA certificate bundle [puppet] - 10https://gerrit.wikimedia.org/r/1119043 (https://phabricator.wikimedia.org/T386092) (owner: 10Brouberol) [14:21:07] ok [14:21:44] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dragoniez, sgimeno: Backport for [[gerrit:1118497|viwiki: Restrict the "changetags" permission to the sysop and bot groups (T385960)]], [[gerrit:1118541|beta: fix typo in GEApiQueryGrowthTasksLookaheadSize variable]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:21:53] Dragoniez: please test using WikimediaDebug :) [14:22:46] e.g. https://vi.wikipedia.org/wiki/%C4%90%E1%BA%B7c_bi%E1%BB%87t:Quy%E1%BB%81n_nh%C3%B3m_ng%C6%B0%E1%BB%9Di_d%C3%B9ng should look different [14:22:52] Everything looks good :) [14:22:55] \o/ [14:22:57] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dragoniez, sgimeno: Continuing with sync [14:23:09] (nothing to test for sergi0 since it’s beta-only ^^) [14:23:20] nope, ty @Lucas_WMDE ! [14:24:13] (03CR) 10CI reject: [V:04-1] feat(AddLink): store null if there is no recommendation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119116 (https://phabricator.wikimedia.org/T382270) (owner: 10Michael Große) [14:25:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web/next (k8s) 1.145s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:25:17] (03CR) 10Andrew Bogott: [C:03+1] kernel-messages: add category=keyword_error [puppet] - 10https://gerrit.wikimedia.org/r/1119123 (https://phabricator.wikimedia.org/T386083) (owner: 10FNegri) [14:28:55] (03CR) 10Brouberol: [C:03+2] ES/rolling-operation: enforce --nodes-per-run=1 on relforge [cookbooks] - 10https://gerrit.wikimedia.org/r/1119056 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [14:29:44] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118497|viwiki: Restrict the "changetags" permission to the sysop and bot groups (T385960)]], [[gerrit:1118541|beta: fix typo in GEApiQueryGrowthTasksLookaheadSize variable]] (duration: 10m 59s) [14:29:47] T385960: Restrict "changetags" userright to sysops and bots on Vietnamese Wikipedia - https://phabricator.wikimedia.org/T385960 [14:30:32] well, let’s try those backports [14:30:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119115 (owner: 10Michael Große) [14:30:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119116 (https://phabricator.wikimedia.org/T382270) (owner: 10Michael Große) [14:30:51] the builds haven’t failed yet, at least [14:31:14] (03CR) 10FNegri: [C:03+2] kernel-messages: add category=keyword_error [puppet] - 10https://gerrit.wikimedia.org/r/1119123 (https://phabricator.wikimedia.org/T386083) (owner: 10FNegri) [14:31:23] 🤞 [14:32:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119115 (owner: 10Michael Große) [14:32:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119116 (https://phabricator.wikimedia.org/T382270) (owner: 10Michael Große) [14:32:49] wtf scap [14:32:54] “The change '1119116' failed build tests and could not be merged” [14:33:16] 🤨 [14:33:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119115 (owner: 10Michael Große) [14:33:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119116 (https://phabricator.wikimedia.org/T382270) (owner: 10Michael Große) [14:34:09] managed to fix it [14:34:17] (by removing the V-1 from jenkins-bot manually) [14:35:49] Lucas_WMDE: there will not be really anything to test for these changes. They modify how a Maintenance script works. So, if the error logs look clear, then we should be good to go [14:35:58] ok [14:36:10] (03Merged) 10jenkins-bot: feat(AddLink): store null if there is no recommendation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119116 (https://phabricator.wikimedia.org/T382270) (owner: 10Michael Große) [14:36:17] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web/next (k8s) 1.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:36:36] yay [14:36:41] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1119115|refactor(AddLink): Make eval steps more legible]], [[gerrit:1119116|feat(AddLink): store null if there is no recommendation (T382270)]] [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:43] 🙌 [14:36:44] T382270: Store the fact that Add Link did not generate any recommendation for a page, don't try again - https://phabricator.wikimedia.org/T382270 [14:39:41] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, migr: Backport for [[gerrit:1119115|refactor(AddLink): Make eval steps more legible]], [[gerrit:1119116|feat(AddLink): store null if there is no recommendation (T382270)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:39:47] (03PS2) 10FNegri: wmcs: kernel_errors: don't alert on warning messages [alerts] - 10https://gerrit.wikimedia.org/r/1118547 (owner: 10Arturo Borrero Gonzalez) [14:41:04] nothing special in mwdebug logstash so far [14:41:17] hm, “MediaWiki\Extension\CommunityConfiguration\Access\MediaWikiConfigReader was unable to find BabelCentralDb in community configuration, returning configuration from the fallback config” [14:41:17] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web/next (k8s) 1.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:41:27] (that’s an INFO message) [14:41:31] hopefully harmless [14:41:39] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, migr: Continuing with sync [14:41:41] and unrelated, but good to know [14:41:42] let’s continue [14:41:49] (03CR) 10FNegri: "I fixed the tests, and also included the new "keyword_error" category" [alerts] - 10https://gerrit.wikimedia.org/r/1118547 (owner: 10Arturo Borrero Gonzalez) [14:42:23] 👍 [14:42:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1094:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1094 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:42:45] (03PS29) 10Bking: Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [14:44:44] Daimona: are you still working on that config change? [14:45:31] I'm working on a seemingly endless supply of bugs in MW, the so called friends I made along the way... [14:45:36] :( [14:45:39] But I can make my config change now [14:45:52] just checking if it should still go into this window [14:45:55] I think we might just have time for it [14:46:00] but no pressure [14:46:23] can also happen later if that’s better [14:47:01] That'd be nice, yep. I'm mostly trying to wrap my head around merge strategies for user right config settings, and trying to understand why it's working in the first place when it seems like it's obviously wrong. [14:47:35] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [14:47:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1094:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1094 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:48:29] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119115|refactor(AddLink): Make eval steps more legible]], [[gerrit:1119116|feat(AddLink): store null if there is no recommendation (T382270)]] (duration: 11m 47s) [14:48:32] T382270: Store the fact that Add Link did not generate any recommendation for a page, don't try again - https://phabricator.wikimedia.org/T382270 [14:48:35] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [14:50:07] (03PS1) 10Daimona Eaytoy: Let sysop add/remove the event-organizer group by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119136 (https://phabricator.wikimedia.org/T376822) [14:50:36] (03PS2) 10Daimona Eaytoy: Let sysops add/remove the event-organizer group by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119136 (https://phabricator.wikimedia.org/T376822) [14:51:13] ^^Made the config change. LMK if there's time for it [14:51:20] jouncebot: nowandnext [14:51:20] For the next 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T1400) [14:51:21] In 0 hour(s) and 8 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T1500) [14:51:31] bit tight :/ [14:51:40] James_F: is it okay if we run over a bit into the wikifunctions window? [14:52:16] we could also do it after then, I’d still be around until 17:00 UTC or so, not sure about you Daimona [14:53:00] Yup, I'll be around the whole afternoon our time [14:53:12] ok, let’s try that then [14:53:23] * MichaelG_WMF is signing off for now, Thank you for the deployment and support! [14:53:31] wait for the wikifunctions window to start and then see when that stops being active :) [14:54:11] OK, thanks. Feel free to ping me when ready. I'll use this time to finish writing the related bug report. [14:54:31] FIRING: [2x] SystemdUnitFailed: mediawiki_job_ImageSuggestions_SendNotificationsForUnillustratedWatchedTitles_CA.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:43] 👍 [14:57:19] RESOLVED: [2x] SystemdUnitFailed: mediawiki_job_ImageSuggestions_SendNotificationsForUnillustratedWatchedTitles_CA.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [15:00:02] (03CR) 10Btullis: [C:03+1] Change dse-k8s-worker1002 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119102 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T1500) [15:00:14] (03CR) 10Btullis: [C:03+1] Change dse-k8s-worker1003 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119103 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [15:00:29] (03CR) 10Btullis: [C:03+1] Change dse-k8s-worker1004 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119104 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [15:00:37] (03CR) 10Btullis: [C:03+1] Change dse-k8s-worker1009 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119105 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [15:01:39] Lucas_WMDE: Sure, we're in parallel anyway. [15:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:52] alright, then I’ll start that now (cc Daimona) [15:01:53] thanks! [15:01:56] (03CR) 10Btullis: "+1 in principle, but I'll wait until the cluster has been completely reimaged and containerd is in use everywhere." [puppet] - 10https://gerrit.wikimedia.org/r/1119106 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [15:02:14] Okay [15:02:29] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-02-03-215824 to 2025-02-11-155417. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119094 (https://phabricator.wikimedia.org/T379977) (owner: 10Cory Massaro) [15:02:35] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Matches what we do for bureaucrats in `$wmgUseTranslate` and `$wmgEnablePageTriage`, so let’s try it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119136 (https://phabricator.wikimedia.org/T376822) (owner: 10Daimona Eaytoy) [15:02:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119136 (https://phabricator.wikimedia.org/T376822) (owner: 10Daimona Eaytoy) [15:03:34] (03Merged) 10jenkins-bot: Let sysops add/remove the event-organizer group by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119136 (https://phabricator.wikimedia.org/T376822) (owner: 10Daimona Eaytoy) [15:03:41] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-02-03-215824 to 2025-02-11-155417. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119094 (https://phabricator.wikimedia.org/T379977) (owner: 10Cory Massaro) [15:04:04] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1119136|Let sysops add/remove the event-organizer group by default (T376822)]] [15:04:08] T376822: Configure the CampaignEvents extension to use the event-organizer group by default - https://phabricator.wikimedia.org/T376822 [15:04:26] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:04:31] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:02] I'm so grateful to the "Unassigned" team for maintaining so many of our critical components. It's a shame I can't tag them on my task. [15:05:15] They're mysterious like that. [15:05:24] Which component is your task about? [15:06:59] !log lucaswerkmeister-wmde@deploy2002 daimona, lucaswerkmeister-wmde: Backport for [[gerrit:1119136|Let sysops add/remove the event-organizer group by default (T376822)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:07:01] MW configuration T386210 [15:07:03] T386210: Setting wgAddGroups and wgRemoveGroups in extension.json is not supported - https://phabricator.wikimedia.org/T386210 [15:07:56] Daimona: Ah, fun. [15:07:59] Daimona: looks good to me so far [15:08:17] (I’m checking siprop=usergroups API output with/without the header) [15:08:47] (03PS1) 10Marostegui: control-mariadb-10.4: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1119145 [15:09:02] https://www.wikidata.org/wiki/Special:ListGroupRights also looks promising [15:09:14] (except that event organizers is a redlink, but someone™ on wikidata will need to fix that ^^) [15:09:17] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.4: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1119145 (owner: 10Marostegui) [15:09:23] anything else you want to test? [15:09:45] (03Merged) 10jenkins-bot: control-mariadb-10.4: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1119145 (owner: 10Marostegui) [15:09:47] yep, I can confirm that the patch fixes the issue on testwiki [15:09:53] oh, apparently https://www.wikidata.org/wiki/Wikidata:Event_Organizers exists, but the link goes to the version with lowercase o [15:09:54] ok! [15:09:59] !log lucaswerkmeister-wmde@deploy2002 daimona, lucaswerkmeister-wmde: Continuing with sync [15:11:39] Daimona: I think the Growth team are the people who most care about that, I guess? But that's definitely not owned official by Growth. [15:13:42] Re redlink: yeah, that page predates the group being enabled by default, so some things might need adjusting... Like overriding https://www.wikidata.org/wiki/MediaWiki:Grouppage-event-organizer [15:14:31] ah, thanks for that link [15:14:35] Re config: yeah, definitely. I seem to recall the MW platform team being involved in MW config, roughly 3 name changes ago, but I don't know if they've ever been official maintainers. [15:14:35] I’d only found https://www.wikidata.org/wiki/MediaWiki:Group-event-organizer [15:14:41] anyway I’ll leave a note on the AN [15:14:49] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:15:11] Thank you! [15:15:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1232', diff saved to https://phabricator.wikimedia.org/P73445 and previous config saved to /var/cache/conftool/dbconfig/20250212-151533-marostegui.json [15:15:36] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:15:48] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1232.eqiad.wmnet [15:15:49] commented at https://www.wikidata.org/wiki/Wikidata:Administrators%27_noticeboard#Event_organizers/Organizers [15:16:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1232.eqiad.wmnet with reason: maintenance [15:16:58] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119136|Let sysops add/remove the event-organizer group by default (T376822)]] (duration: 12m 53s) [15:17:00] T376822: Configure the CampaignEvents extension to use the event-organizer group by default - https://phabricator.wikimedia.org/T376822 [15:17:16] (03CR) 10Stevemunene: [C:03+2] Change dse-k8s-worker1002 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119102 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [15:17:36] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [15:18:12] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm [15:18:57] !log UTC backport+config window done [15:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:08] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1119157 (https://phabricator.wikimedia.org/T386213) [15:21:13] (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1119158 (https://phabricator.wikimedia.org/T386213) [15:22:14] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1232.eqiad.wmnet [15:23:00] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1232.eqiad.wmnet with reason: Index rebuild [15:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10544954 (10phaultfinder) [15:25:51] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:27:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2176 T385561', diff saved to https://phabricator.wikimedia.org/P73446 and previous config saved to /var/cache/conftool/dbconfig/20250212-152738-marostegui.json [15:27:42] T385561: Upgrade and rebuild s1 - https://phabricator.wikimedia.org/T385561 [15:29:00] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2176.codfw.wmnet [15:30:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: maintenance [15:30:57] FIRING: CalicoHighMemoryUsage: Calico container calico-kube-controllers-7cff657b4f-6pxt7:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [15:31:44] !log stevemunene@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm [15:32:19] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:32:22] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm [15:32:55] (03PS1) 10Cory Massaro: Revert "wikifunctions: Upgrade orchestrator from 2025-02-03-215824 to 2025-02-11-155417." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119163 [15:34:03] (03CR) 10Cory Massaro: [C:03+2] Revert "wikifunctions: Upgrade orchestrator from 2025-02-03-215824 to 2025-02-11-155417." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119163 (owner: 10Cory Massaro) [15:34:27] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f4bfb1e2280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech [15:34:27] ia.org/wiki/Search%23Administration [15:34:27] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f1a32b6b280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech [15:34:27] ia.org/wiki/Search%23Administration [15:34:29] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 8 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 8, active_shards: 8, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 8, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, nu [15:34:29] in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:34:29] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 255 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 259, active_shards: 259, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 255, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number [15:34:29] light_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.38910505836576 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:35:19] (03Merged) 10jenkins-bot: Revert "wikifunctions: Upgrade orchestrator from 2025-02-03-215824 to 2025-02-11-155417." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119163 (owner: 10Cory Massaro) [15:35:29] (03CR) 10Pppery: "Congrats!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118497 (https://phabricator.wikimedia.org/T385960) (owner: 10Dragoniez) [15:35:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-kube-controllers-7cff657b4f-6pxt7:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [15:36:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2176.codfw.wmnet [15:37:20] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Index rebuild [15:38:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on relforge1004:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:39:19] (03CR) 10Dragoniez: "Lucas, Pppery, thanks for your help 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118497 (https://phabricator.wikimedia.org/T385960) (owner: 10Dragoniez) [15:40:25] FIRING: [3x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:28] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on relforge1004:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:50:25] FIRING: [4x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:58] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cr2-magru with reason: IBGP instability from cr1 to cr2 in magru causing ping faulures from alert1002 [16:09:10] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [alerts] - 10https://gerrit.wikimedia.org/r/1118547 (owner: 10Arturo Borrero Gonzalez) [16:09:22] !log Deleting benthos, changeprop, changeprop-jobqueue from staging to free pod ip blocks - T386107 [16:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:25] T386107: k8s staging seems to be out of IP addresses - https://phabricator.wikimedia.org/T386107 [16:13:57] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10545248 (10BCornwall) @RobH Looks like the offset change has made a good difference {F58391798} How would you feel about applying this to esams as well and then codifyin... [16:14:44] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/mw-api-int: apply [16:14:59] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-api-int: apply [16:15:07] (03CR) 10FNegri: [C:03+2] wmcs: kernel_errors: don't alert on warning messages [alerts] - 10https://gerrit.wikimedia.org/r/1118547 (owner: 10Arturo Borrero Gonzalez) [16:15:12] !log Halving mw-api-int staging replicas to free pod ip blocks - T386107 [16:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:15] T386107: k8s staging seems to be out of IP addresses - https://phabricator.wikimedia.org/T386107 [16:15:25] FIRING: [5x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:37] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10545263 (10RobH) >>! In T373993#10545248, @BCornwall wrote: > @RobH Looks like the offset change has made a good difference > > {F58391798} > > How would you feel about... [16:22:23] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10545283 (10RobH) We will need to also have the provision cookbook updated for a new thermal profile setting flag to set these automatically via that cookbook. > The s... [16:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10545307 (10phaultfinder) [16:26:24] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:26:27] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:26:59] (03PS3) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-01-30-011236 to 2025-02-11-155338. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119109 (https://phabricator.wikimedia.org/T383631) (owner: 10Cory Massaro) [16:27:00] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-01-30-011236 to 2025-02-11-155338. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119109 (https://phabricator.wikimedia.org/T383631) (owner: 10Cory Massaro) [16:28:22] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-01-30-011236 to 2025-02-11-155338. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119109 (https://phabricator.wikimedia.org/T383631) (owner: 10Cory Massaro) [16:29:04] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:29:43] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:30:20] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:31:07] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:31:15] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:32:02] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:34:41] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm [16:43:12] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm [16:53:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [16:59:15] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: host reimage [16:59:41] (03PS1) 10Brouberol: airflow-analytics: fix typo in config [puppet] - 10https://gerrit.wikimedia.org/r/1119185 (https://phabricator.wikimedia.org/T386092) [17:00:41] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4937/co" [puppet] - 10https://gerrit.wikimedia.org/r/1119185 (https://phabricator.wikimedia.org/T386092) (owner: 10Brouberol) [17:00:48] FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:02:51] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: host reimage [17:08:18] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2230.codfw.wmnet with reason: maintenance [17:10:16] !log Install 10.6.21 on db2230 T385678 [17:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:19] T385678: Compile and package MariaDB 10.11.11 and MariaDB 10.6.21 - https://phabricator.wikimedia.org/T385678 [17:12:40] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [17:13:42] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [17:14:07] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [17:14:56] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [17:20:35] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm [17:27:48] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-02-03-215824 to 2025-02-12-171406 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119191 (https://phabricator.wikimedia.org/T383631) [17:32:03] <_Gerges> Hi [17:33:19] <_Gerges> Can I upload a patch to a user talk page namespace for this task T371470 instead of creating a new task ? [17:33:20] T371470: Set noindex for user pages on Arabic Wikipedia - https://phabricator.wikimedia.org/T371470 [17:36:42] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10Observability-Logging: decommission logstash202[6-9] - https://phabricator.wikimedia.org/T383288#10545695 (10Jhancock.wm) [17:47:19] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:50:54] (03PS3) 10Federico Ceratto: clone.py, clone_test.py: Implement full DB cloning runbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1118099 [17:58:59] (03PS1) 10GergesShamon: [arwiki] Set noindex for namespace user talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119194 (https://phabricator.wikimedia.org/T371470) [18:00:04] (03PS1) 10Bking: dumpsdata: remove decom'd servers from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1119195 (https://phabricator.wikimedia.org/T353787) [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T1800) [18:00:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119194 (https://phabricator.wikimedia.org/T371470) (owner: 10GergesShamon) [18:02:26] (03CR) 10Stevemunene: [C:03+1] dumpsdata: remove decom'd servers from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1119195 (https://phabricator.wikimedia.org/T353787) (owner: 10Bking) [18:02:27] (03PS2) 10Arlolra: Bust cache for recreated pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118890 (https://phabricator.wikimedia.org/T386244) [18:03:27] (03Abandoned) 10Ottomata: MediaWikiPingback is now on event platform. Use eventlogging_legacy refine job [puppet] - 10https://gerrit.wikimedia.org/r/1050008 (https://phabricator.wikimedia.org/T323828) (owner: 10Ottomata) [18:11:38] (03PS4) 10Federico Ceratto: clone.py, clone_test.py: Implement full DB cloning runbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1118099 [18:12:29] (03PS1) 10DLynch: MobileFrontend: remove override for default mobile editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119197 (https://phabricator.wikimedia.org/T361134) [18:14:29] (03CR) 10Bking: [C:03+2] dumpsdata: remove decom'd servers from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1119195 (https://phabricator.wikimedia.org/T353787) (owner: 10Bking) [18:26:15] (03PS1) 10BryanDavis: toolhub: Add pod.kubernetes.io/sidecars annotation to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) [18:26:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73447 and previous config saved to /var/cache/conftool/dbconfig/20250212-182637-root.json [18:31:15] (03CR) 10Andrew Bogott: "I have just scheduled this to be merged on Monday the 17th." [puppet] - 10https://gerrit.wikimedia.org/r/1118151 (https://phabricator.wikimedia.org/T380679) (owner: 10Andrew Bogott) [18:31:17] (03PS2) 10BryanDavis: toolhub: Add pod.kubernetes.io/sidecars annotation to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) [18:35:35] !log bking@cumin2002 START - Cookbook sre.hosts.dhcp for host relforge1004.eqiad.wmnet [18:41:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73448 and previous config saved to /var/cache/conftool/dbconfig/20250212-184143-root.json [18:43:42] (03CR) 10BryanDavis: "Adding rzl as reviewer for a check on my helm changes plus advice on what else needs to be done to make the controller added in T348284 ac" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [18:44:21] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [18:44:55] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [18:45:19] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [18:45:56] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [18:46:59] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [18:47:29] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [18:47:43] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [18:47:57] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [18:50:49] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [18:50:54] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [18:51:03] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [18:51:40] PROBLEM - Host relforge1004 is DOWN: PING CRITICAL - Packet loss = 100% [18:51:52] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [18:52:26] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [18:53:07] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [18:53:14] RECOVERY - Host relforge1004 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [18:55:25] RESOLVED: [5x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:55:48] RESOLVED: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:56:02] (03PS1) 10Ahmon Dancy: logspam.pl: Add emacs mode line [puppet] - 10https://gerrit.wikimedia.org/r/1119201 [18:56:09] PROBLEM - Elasticsearch HTTPS for relforge-eqiad on relforge1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [18:56:10] (03PS1) 10Ahmon Dancy: logspam.pl: [puppet] - 10https://gerrit.wikimedia.org/r/1119202 (https://phabricator.wikimedia.org/T347064) [18:56:10] PROBLEM - Elasticsearch HTTPS for relforge-eqiad-small-alpha on relforge1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [18:56:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73449 and previous config saved to /var/cache/conftool/dbconfig/20250212-185649-root.json [18:57:18] (03PS2) 10Ahmon Dancy: logspam.pl: [puppet] - 10https://gerrit.wikimedia.org/r/1119202 (https://phabricator.wikimedia.org/T347064) [19:02:13] (03CR) 10Ahmon Dancy: "Andre, you can test on mwlog1002.eqiad.wmnet like so:" [puppet] - 10https://gerrit.wikimedia.org/r/1119202 (https://phabricator.wikimedia.org/T347064) (owner: 10Ahmon Dancy) [19:02:30] (03CR) 10Elukey: "Helloooo! I am a bit lost, didn't we already do it in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1117550/1/modules/turnilo/templ" [puppet] - 10https://gerrit.wikimedia.org/r/1118477 (owner: 10Joal) [19:07:19] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:09:37] (03PS3) 10Ahmon Dancy: logspam.pl: Consolidate the "Failed to load data blob" exception [puppet] - 10https://gerrit.wikimedia.org/r/1119202 (https://phabricator.wikimedia.org/T347064) [19:11:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73450 and previous config saved to /var/cache/conftool/dbconfig/20250212-191155-root.json [19:13:27] (03Abandoned) 10Phuedx: tests: Assert event stream configs have valid samples [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115839 (owner: 10Phuedx) [19:14:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73451 and previous config saved to /var/cache/conftool/dbconfig/20250212-191404-root.json [19:23:49] (03PS1) 10Urbanecm: [Growth] enwiki: Enable mentorship for 100% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119204 (https://phabricator.wikimedia.org/T384505) [19:26:09] (03CR) 10RLazarus: [C:03+1] "In order to enable the controller in the namespace, you'll need a patch like this: https://gerrit.wikimedia.org/r/c/operations/deployment-" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [19:27:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73452 and previous config saved to /var/cache/conftool/dbconfig/20250212-192700-root.json [19:29:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73453 and previous config saved to /var/cache/conftool/dbconfig/20250212-192909-root.json [19:31:00] (03PS1) 10Ottomata: eventgate-analytics remove canary release from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119206 (https://phabricator.wikimedia.org/T383814) [19:32:19] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:44:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73454 and previous config saved to /var/cache/conftool/dbconfig/20250212-194414-root.json [19:44:34] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:46:34] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:50:24] (03PS1) 10Zabe: Reduce revision-slots cache expiry to 60s on diqwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119207 [19:59:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73455 and previous config saved to /var/cache/conftool/dbconfig/20250212-195919-root.json [19:59:40] PROBLEM - Elasticsearch HTTPS for relforge-eqiad on relforge1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [19:59:46] PROBLEM - Elasticsearch HTTPS for relforge-eqiad-small-alpha on relforge1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [20:00:04] ^^ expected, will silence [20:01:22] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on relforge1004.eqiad.wmnet with reason: T380752 [20:01:26] T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752 [20:14:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73456 and previous config saved to /var/cache/conftool/dbconfig/20250212-201424-root.json [20:15:34] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:17:34] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:22:59] (03CR) 10BryanDavis: "Your quick attention is appreciated. I expected not to hear back for a while because of the offsite." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [20:30:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:30:40] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:30:50] (03CR) 10Clément Goubert: [C:03+1] eventgate-analytics remove canary release from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119206 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [20:51:22] <_Gerges> Ping [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T2100). [21:00:05] _Gerges: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:39] _Gerges: do you need a deployer? [21:00:55] <_Gerges> Here [21:01:29] <_Gerges> @cjming: yes [21:01:44] ok! [21:02:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119194 (https://phabricator.wikimedia.org/T371470) (owner: 10GergesShamon) [21:03:04] (03Merged) 10jenkins-bot: [arwiki] Set noindex for namespace user talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119194 (https://phabricator.wikimedia.org/T371470) (owner: 10GergesShamon) [21:03:35] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1119194|[arwiki] Set noindex for namespace user talk (T371470)]] [21:03:40] T371470: Set noindex for user pages on Arabic Wikipedia - https://phabricator.wikimedia.org/T371470 [21:06:38] !log cjming@deploy2002 cjming, gergesshamon: Backport for [[gerrit:1119194|[arwiki] Set noindex for namespace user talk (T371470)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:06:54] _Gerges: ok to sync? [21:07:13] <_Gerges> Ok [21:07:38] i think that means sync - syncing! [21:07:57] !log cjming@deploy2002 cjming, gergesshamon: Continuing with sync [21:14:41] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119194|[arwiki] Set noindex for namespace user talk (T371470)]] (duration: 11m 05s) [21:14:44] T371470: Set noindex for user pages on Arabic Wikipedia - https://phabricator.wikimedia.org/T371470 [21:14:53] _Gerges: should be live :) [21:15:11] do i need to run the namespace dupes script? [21:16:12] <_Gerges> Yes [21:16:53] Why would you need to run namespace dupes? [21:17:01] You're not adding a new NS, or changing aliases etc [21:18:31] <_Gerges> Sorry, bad internet I said "yes" to "should be live" [21:18:46] so no need to run script then - gtk ! [21:19:13] <_Gerges> Thank you [21:29:03] (03CR) 10RLazarus: [C:03+1] toolhub: Add pod.kubernetes.io/sidecars annotation to CronJob (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [21:31:34] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1124-1128].eqiad.wmnet [21:31:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1124-1128].eqiad.wmnet [21:35:04] (03PS1) 10C. Scott Ananian: Turn on Parsoid Read Views for 34 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119215 (https://phabricator.wikimedia.org/T386272) [21:35:06] (03PS1) 10C. Scott Ananian: Turn on Parsoid Read Views for mobile wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119216 (https://phabricator.wikimedia.org/T386272) [21:47:19] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T2200) [22:02:04] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-02-03-215824 to 2025-02-12-171406 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119191 (https://phabricator.wikimedia.org/T383631) (owner: 10Jforrester) [22:02:53] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [22:03:13] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-02-03-215824 to 2025-02-12-171406 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119191 (https://phabricator.wikimedia.org/T383631) (owner: 10Jforrester) [22:04:14] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [22:04:47] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [22:05:48] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [22:06:36] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [22:06:41] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [22:07:27] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [22:12:53] (03PS2) 10C. Scott Ananian: Turn on Parsoid fragment support everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) [22:13:01] (03CR) 10C. Scott Ananian: Turn on Parsoid fragment support everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian) [22:13:45] (03PS3) 10C. Scott Ananian: Turn on Parsoid fragment support everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) [22:14:05] (03CR) 10C. Scott Ananian: "Still blocked until Icaf238844f092bc061b6383c8bfc863f3f2fd87d is tagged and rides the train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian) [22:23:52] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host relforge1004.eqiad.wmnet [22:44:19] (03PS1) 10Bking: cirrus: disable opensearch-madvise while we debate its future [puppet] - 10https://gerrit.wikimedia.org/r/1119227 (https://phabricator.wikimedia.org/T386281) [22:45:34] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119227 (https://phabricator.wikimedia.org/T386281) (owner: 10Bking) [22:47:22] (03CR) 10Ryan Kemper: [C:03+1] cirrus: disable opensearch-madvise while we debate its future [puppet] - 10https://gerrit.wikimedia.org/r/1119227 (https://phabricator.wikimedia.org/T386281) (owner: 10Bking) [22:47:51] (03CR) 10Bking: [C:03+2] cirrus: disable opensearch-madvise while we debate its future [puppet] - 10https://gerrit.wikimedia.org/r/1119227 (https://phabricator.wikimedia.org/T386281) (owner: 10Bking) [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250212T2300) [23:05:11] Hi all! We will be using the web deploy window for today [23:07:19] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:08:59] (03PS3) 10BryanDavis: toolhub: Add pod.kubernetes.io/sidecars annotation to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) [23:09:00] (03PS1) 10BryanDavis: admin_ng: Swtich on enableJobSidecarController for toolhub [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119231 (https://phabricator.wikimedia.org/T292861) [23:14:08] (03PS1) 10Stoyofuku-wmf: Lazy Load Images [extensions/MobileFrontend] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1119233 (https://phabricator.wikimedia.org/T366402) [23:15:30] (03PS1) 10Stoyofuku-wmf: Lazy Load Images [extensions/MobileFrontend] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119234 (https://phabricator.wikimedia.org/T366402) [23:16:34] Doing deploys now! Hopefully that is alright with you all [23:18:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1119233 (https://phabricator.wikimedia.org/T366402) (owner: 10Stoyofuku-wmf) [23:18:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119234 (https://phabricator.wikimedia.org/T366402) (owner: 10Stoyofuku-wmf) [23:27:18] (03Merged) 10jenkins-bot: Lazy Load Images [extensions/MobileFrontend] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1119233 (https://phabricator.wikimedia.org/T366402) (owner: 10Stoyofuku-wmf) [23:28:20] (03Merged) 10jenkins-bot: Lazy Load Images [extensions/MobileFrontend] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119234 (https://phabricator.wikimedia.org/T366402) (owner: 10Stoyofuku-wmf) [23:28:52] !log toyofuku@deploy2002 Started scap sync-world: Backport for [[gerrit:1119233|Lazy Load Images (T366402)]], [[gerrit:1119234|Lazy Load Images (T366402)]] [23:31:32] (03PS2) 10Zabe: Reduce revision-slots cache expiry to 60s on diqwiki and ttwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119207 (https://phabricator.wikimedia.org/T183490) [23:32:19] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:34:25] !log toyofuku@deploy2002 toyofuku: Backport for [[gerrit:1119233|Lazy Load Images (T366402)]], [[gerrit:1119234|Lazy Load Images (T366402)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:35:01] Loren will be testing for us! Brb [23:51:39] !log toyofuku@deploy2002 toyofuku: Continuing with sync [23:56:58] (03CR) 10Brennen Bearnes: [C:03+1] "Legit." [puppet] - 10https://gerrit.wikimedia.org/r/1119201 (owner: 10Ahmon Dancy)