[00:07:26] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11063102 (10Jhancock.wm) @Papaul this one did the thing about going to the wrong puppet server again. Can you delete it so i can try again later? [8/10, retrying in 640.00s] Attem... [00:07:34] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11063103 (10Jhancock.wm) [00:08:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175970 [00:08:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175970 (owner: 10TrainBranchBot) [00:08:53] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#11063104 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm thanks @elukey! [00:14:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P80872 and previous config saved to /var/cache/conftool/dbconfig/20250806-001413-fceratto.json [00:29:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T399728)', diff saved to https://phabricator.wikimedia.org/P80873 and previous config saved to /var/cache/conftool/dbconfig/20250806-002921-fceratto.json [00:29:25] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [00:29:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [00:45:30] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175970 (owner: 10TrainBranchBot) [00:47:49] (03CR) 10Umherirrender: "Known failure: T400950" [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175970 (owner: 10TrainBranchBot) [00:49:00] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11063161 (10Papaul) @Jhancock.wm done [01:24:08] (03PS1) 10Andrew Bogott: profile::wmcs::chartmuseum: install cm-push [puppet] - 10https://gerrit.wikimedia.org/r/1175972 [01:24:33] (03CR) 10CI reject: [V:04-1] profile::wmcs::chartmuseum: install cm-push [puppet] - 10https://gerrit.wikimedia.org/r/1175972 (owner: 10Andrew Bogott) [01:25:09] (03PS2) 10Andrew Bogott: profile::wmcs::chartmuseum: install cm-push [puppet] - 10https://gerrit.wikimedia.org/r/1175972 [01:28:02] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] profile::wmcs::chartmuseum: install cm-push [puppet] - 10https://gerrit.wikimedia.org/r/1175972 (owner: 10Andrew Bogott) [02:11:12] (03CR) 10RLazarus: "Some high-level questions about this, after reading through UpdateConfigs.php and T398422. If you've already talked through all this with " [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [02:24:46] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11063186 (10Jhancock.wm) 05Open→03Resolved @MatthewVernon we're finished with this test server if you want to run some test on it. It's a 1 CPU version of the config-J ser... [02:30:53] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#11063190 (10Jhancock.wm) 05Open→03Resolved @BTullis this is a 1CPU version of the config I servers you use for the an-worker and an-presto servers. It's in codfw so I'm not sure... [03:02:46] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [03:05:20] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [03:09:14] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [03:09:30] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:12:46] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:14] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [03:22:35] jhancock@cumin1003 provision (PID 782454) is awaiting input [03:43:21] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [03:43:45] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#11063213 (10Jhancock.wm) 05Open→03Resolved @Marostegui got you a clean raid10. fyi, it is provisioned as uefi. thanks for your patience! [05:09:44] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T0600) [06:12:13] (03PS1) 10Giuseppe Lavagetto: haproxy: create the x-provenance map even when not using etcd [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) [06:12:14] (03PS1) 10Giuseppe Lavagetto: varnish: also convert abuse networks to use x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621) [06:12:16] (03PS1) 10Giuseppe Lavagetto: Remove blocked-nets from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1175991 [06:12:16] (03PS1) 10Giuseppe Lavagetto: varnish: stop loading netmaps [puppet] - 10https://gerrit.wikimedia.org/r/1175992 (https://phabricator.wikimedia.org/T396621) [06:12:39] (03CR) 10CI reject: [V:04-1] haproxy: create the x-provenance map even when not using etcd [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [06:14:44] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:20:56] (03PS2) 10Giuseppe Lavagetto: haproxy: create the x-provenance map even when not using etcd [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) [06:20:56] (03PS2) 10Giuseppe Lavagetto: varnish: also convert abuse networks to use x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621) [06:20:56] (03PS2) 10Giuseppe Lavagetto: Remove blocked-nets from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1175991 [06:20:56] (03PS2) 10Giuseppe Lavagetto: varnish: stop loading netmaps [puppet] - 10https://gerrit.wikimedia.org/r/1175992 (https://phabricator.wikimedia.org/T396621) [06:24:11] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6502/co" [puppet] - 10https://gerrit.wikimedia.org/r/1175992 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [06:27:09] (03PS1) 10KartikMistry: Enable the Contribute menu in 9th group of Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176040 (https://phabricator.wikimedia.org/T397122) [06:28:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176040 (https://phabricator.wikimedia.org/T397122) (owner: 10KartikMistry) [06:35:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [06:35:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T399728)', diff saved to https://phabricator.wikimedia.org/P80874 and previous config saved to /var/cache/conftool/dbconfig/20250806-063521-fceratto.json [06:35:24] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [06:39:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T399728)', diff saved to https://phabricator.wikimedia.org/P80875 and previous config saved to /var/cache/conftool/dbconfig/20250806-063903-fceratto.json [06:54:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P80876 and previous config saved to /var/cache/conftool/dbconfig/20250806-065410-fceratto.json [07:00:04] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:19] here [07:00:28] I'll deploy myself. [07:02:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176040 (https://phabricator.wikimedia.org/T397122) (owner: 10KartikMistry) [07:03:17] (03Merged) 10jenkins-bot: Enable the Contribute menu in 9th group of Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176040 (https://phabricator.wikimedia.org/T397122) (owner: 10KartikMistry) [07:03:52] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1176040|Enable the Contribute menu in 9th group of Wikipedias (T397122)]] [07:03:56] T397122: Enable the Contribute menu in 9th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T397122 [07:05:54] !log kartik@deploy1003 kartik: Backport for [[gerrit:1176040|Enable the Contribute menu in 9th group of Wikipedias (T397122)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:08:16] !log kartik@deploy1003 kartik: Continuing with sync [07:09:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P80877 and previous config saved to /var/cache/conftool/dbconfig/20250806-070918-fceratto.json [07:09:30] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:13:23] (03PS1) 10Slyngshede: data.yaml: re-add email [puppet] - 10https://gerrit.wikimedia.org/r/1176095 [07:13:29] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176040|Enable the Contribute menu in 9th group of Wikipedias (T397122)]] (duration: 09m 37s) [07:13:32] T397122: Enable the Contribute menu in 9th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T397122 [07:14:40] I'm done. No more patches in the window AFAIK. [07:23:19] (03PS1) 10Majavah: P:toolforge::elasticsearch: Set new extra_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/1176122 (https://phabricator.wikimedia.org/T401278) [07:24:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T399728)', diff saved to https://phabricator.wikimedia.org/P80878 and previous config saved to /var/cache/conftool/dbconfig/20250806-072425-fceratto.json [07:24:30] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [07:24:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1184.eqiad.wmnet with reason: Maintenance [07:24:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T399728)', diff saved to https://phabricator.wikimedia.org/P80879 and previous config saved to /var/cache/conftool/dbconfig/20250806-072448-fceratto.json [07:33:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T399728)', diff saved to https://phabricator.wikimedia.org/P80880 and previous config saved to /var/cache/conftool/dbconfig/20250806-073343-fceratto.json [07:33:48] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [07:35:49] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Decommission cirrussearch2055-2060 - https://phabricator.wikimedia.org/T395855#11063541 (10brouberol) ` ~ ❯ ssh cirrussearch2055.codfw.wmnet Stdio forwarding request failed: Session open refused by peer Connection closed by UN... [07:43:31] (03PS1) 10Brouberol: datahub: increasae memory for frontend and mae-consumer pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176187 (https://phabricator.wikimedia.org/T398599) [07:45:22] !log created wikilove tables on thwiki T401279 [07:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:26] T401279: Extension WikiLove for th.wikipedaia.org - https://phabricator.wikimedia.org/T401279 [07:46:52] (03PS2) 10Brouberol: datahub: increase memory for frontend and mae-consumer pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176187 (https://phabricator.wikimedia.org/T398599) [07:48:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P80881 and previous config saved to /var/cache/conftool/dbconfig/20250806-074851-fceratto.json [07:54:00] (03PS5) 10Federico Ceratto: Add MariaDB test-s8 section VMs [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087) [07:56:33] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1176122 (https://phabricator.wikimedia.org/T401278) (owner: 10Majavah) [07:56:39] (03CR) 10Majavah: [C:03+2] P:toolforge::elasticsearch: Set new extra_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/1176122 (https://phabricator.wikimedia.org/T401278) (owner: 10Majavah) [07:57:05] (03PS1) 10Chlod Alejandro: thwiki: enable WikiLove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176189 (https://phabricator.wikimedia.org/T401279) [07:59:16] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6503/console" [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [07:59:31] (03CR) 10Reedy: [C:03+2] thwiki: enable WikiLove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176189 (https://phabricator.wikimedia.org/T401279) (owner: 10Chlod Alejandro) [08:00:01] `%{message}` ahh [08:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T0800) [08:00:05] things never changes :] [08:00:21] hashar: hackathon says hello [08:00:23] that is good old josnTruncated messages [08:00:25] (03Merged) 10jenkins-bot: thwiki: enable WikiLove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176189 (https://phabricator.wikimedia.org/T401279) (owner: 10Chlod Alejandro) [08:00:55] Reedy: hi hackathon! Please please waves your hands shouting "TRAIN IS ROLLING NOW!" [08:00:56] :) [08:01:05] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1176189|thwiki: enable WikiLove (T401279)]] [08:01:09] T401279: Extension WikiLove for th.wikipedia.org - https://phabricator.wikimedia.org/T401279 [08:01:13] I wanna check that jsontruncated message though [08:01:30] ","message":"AbuseFilter parser error: ID: regexfailure; position: 148; params: ... [08:01:37] * hashar files a task [08:03:02] !log reedy@deploy1003 reedy, chlod: Backport for [[gerrit:1176189|thwiki: enable WikiLove (T401279)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:03:49] !log reedy@deploy1003 reedy, chlod: Continuing with sync [08:03:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P80882 and previous config saved to /var/cache/conftool/dbconfig/20250806-080359-fceratto.json [08:08:14] (03CR) 10Vgutierrez: [V:03+1 C:03+1] haproxy: create the x-provenance map even when not using etcd [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [08:08:51] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176189|thwiki: enable WikiLove (T401279)]] (duration: 07m 46s) [08:08:54] T401279: Extension WikiLove for th.wikipedia.org - https://phabricator.wikimedia.org/T401279 [08:09:38] https://phabricator.wikimedia.org/T401285 [08:09:46] AbuseFilter parser error: ID: regexfailure; position: 148; params: /{{short description|American politician}}\\n{{Infobox officeholder \\n| name = Anthony Frontzak\\n|image = Pat Toomey, Official Portrait, 112th Congress.jpg\\n|.... [08:10:01] (03PS1) 10Brouberol: airflow: add kafka-main-{eqiad,codfw}-external to the common connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176190 (https://phabricator.wikimedia.org/T372912) [08:13:19] (03CR) 10MVernon: "The Phab task talks about 3 hosts per DC, which would be 6, but you have only 5 here. Is that intentional?" [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087) (owner: 10Federico Ceratto) [08:13:23] (03CR) 10DCausse: [C:03+1] airflow: add kafka-main-{eqiad,codfw}-external to the common connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176190 (https://phabricator.wikimedia.org/T372912) (owner: 10Brouberol) [08:13:59] (03CR) 10Brouberol: [C:03+2] airflow: add kafka-main-{eqiad,codfw}-external to the common connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176190 (https://phabricator.wikimedia.org/T372912) (owner: 10Brouberol) [08:15:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [08:16:27] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [08:17:52] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176191 (https://phabricator.wikimedia.org/T396374) [08:17:54] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176191 (https://phabricator.wikimedia.org/T396374) (owner: 10TrainBranchBot) [08:18:43] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176191 (https://phabricator.wikimedia.org/T396374) (owner: 10TrainBranchBot) [08:19:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T399728)', diff saved to https://phabricator.wikimedia.org/P80883 and previous config saved to /var/cache/conftool/dbconfig/20250806-081906-fceratto.json [08:19:10] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [08:19:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1186.eqiad.wmnet with reason: Maintenance [08:19:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T399728)', diff saved to https://phabricator.wikimedia.org/P80884 and previous config saved to /var/cache/conftool/dbconfig/20250806-081929-fceratto.json [08:20:04] (03PS1) 10Chlod Alejandro: thwiki: add WT namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176192 (https://phabricator.wikimedia.org/T401287) [08:23:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T399728)', diff saved to https://phabricator.wikimedia.org/P80885 and previous config saved to /var/cache/conftool/dbconfig/20250806-082311-fceratto.json [08:25:55] !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.13 refs T396374 [08:25:59] T396374: 1.45.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T396374 [08:26:02] (03CR) 10Vgutierrez: [C:04-1] "varnish upload tests are happy: `0 tests failed, 0 tests skipped, 19 tests passed`" [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [08:27:51] (03CR) 10Vgutierrez: Remove blocked-nets from varnish (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (owner: 10Giuseppe Lavagetto) [08:33:02] (03CR) 10Clément Goubert: [C:03+1] Alertmanager: add receiver and routing for experiment-platform tasks [puppet] - 10https://gerrit.wikimedia.org/r/1175928 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming) [08:33:19] (03CR) 10Clément Goubert: mw::maintenance: ExperimentationLab periodic job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [08:34:10] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from snapshot1016 to dse-k8s-worker1019 [08:36:12] (03CR) 10Phuedx: [C:03+1] Alertmanager: add receiver and routing for experiment-platform tasks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175928 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming) [08:38:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P80886 and previous config saved to /var/cache/conftool/dbconfig/20250806-083818-fceratto.json [08:39:18] (03PS5) 10Clément Goubert: mw::maintenance: ExperimentationLab periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [08:39:20] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [08:39:26] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from snapshot1016 to dse-k8s-worker1019 [08:40:02] (03CR) 10Federico Ceratto: "Replied to a question" [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087) (owner: 10Federico Ceratto) [08:42:58] (03PS11) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) [08:43:10] (03CR) 10Anzx: "minor changes to add task id to appropriate place" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176192 (https://phabricator.wikimedia.org/T401287) (owner: 10Chlod Alejandro) [08:43:42] (03CR) 10MVernon: [C:03+1] Add MariaDB test-s8 section VMs [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087) (owner: 10Federico Ceratto) [08:44:26] (03CR) 10MVernon: "Hi @ltoscano@wikimedia.org is this a more helpful comment?" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [08:45:06] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from snapshot1016 to dse-k8s-worker1019 [08:48:15] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [08:49:44] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11063859 (10MatthewVernon) [08:52:10] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1016 to dse-k8s-worker1019 - btullis@cumin1003" [08:52:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1016 to dse-k8s-worker1019 - btullis@cumin1003" [08:52:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:52:30] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-worker1019 on all recursors [08:52:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker1019 on all recursors [08:52:33] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1019 [08:53:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P80887 and previous config saved to /var/cache/conftool/dbconfig/20250806-085326-fceratto.json [08:54:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1019 [08:55:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from snapshot1016 to dse-k8s-worker1019 [08:56:52] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [08:59:55] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy revertrisk-language-agnostic latest published image on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175888 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis) [09:01:32] (03Merged) 10jenkins-bot: ml-services: Deploy revertrisk-language-agnostic latest published image on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175888 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis) [09:07:24] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:07:42] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:08:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T399728)', diff saved to https://phabricator.wikimedia.org/P80888 and previous config saved to /var/cache/conftool/dbconfig/20250806-090833-fceratto.json [09:08:37] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [09:08:49] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1195.eqiad.wmnet with reason: Maintenance [09:08:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T399728)', diff saved to https://phabricator.wikimedia.org/P80889 and previous config saved to /var/cache/conftool/dbconfig/20250806-090856-fceratto.json [09:12:06] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1015.eqiad.wmnet with reason: host reimage [09:12:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T399728)', diff saved to https://phabricator.wikimedia.org/P80890 and previous config saved to /var/cache/conftool/dbconfig/20250806-091235-fceratto.json [09:13:25] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11063973 (10Joe) [09:15:54] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1016.eqiad.wmnet with OS bookworm [09:18:00] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm [09:18:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1015.eqiad.wmnet with reason: host reimage [09:19:17] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1018.eqiad.wmnet with OS bookworm [09:20:33] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1019.eqiad.wmnet with OS bookworm [09:22:14] 10SRE-SLO, 10EditCheck, 10Lift-Wing, 06Machine-Learning-Team, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11064019 (10gkyziridis) Hey @elukey thnx for sharing this issue. I have a question: Is this issue blocking the A/B testi... [09:22:22] (03CR) 10Jelto: [C:03+2] add more providers to fetch_external_clouds:vendors_nets.py [puppet] - 10https://gerrit.wikimedia.org/r/1175781 (https://phabricator.wikimedia.org/T401003) (owner: 10Jelto) [09:26:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:27:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P80891 and previous config saved to /var/cache/conftool/dbconfig/20250806-092743-fceratto.json [09:29:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175919 (https://phabricator.wikimedia.org/T400118) (owner: 10Sergio Gimeno) [09:31:09] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1016.eqiad.wmnet with reason: host reimage [09:33:03] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1017.eqiad.wmnet with reason: host reimage [09:34:15] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1016.eqiad.wmnet with reason: host reimage [09:34:16] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1018.eqiad.wmnet with reason: host reimage [09:35:37] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [09:36:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:38:20] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1018.eqiad.wmnet with reason: host reimage [09:38:33] (03CR) 10Jelto: [C:04-1] "I'd prefer to do that in requestctl. It's already quite complex to troubleshoot why certain request got blocked, so having that in one pla" [puppet] - 10https://gerrit.wikimedia.org/r/1175933 (owner: 10Dzahn) [09:38:38] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1019.eqiad.wmnet with reason: host reimage [09:41:25] FIRING: SystemdUnitFailed: nginx.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:49] (03PS1) 10Hashar: ExperimentManager: Fix #getExperiment() when uninitialized [extensions/MetricsPlatform] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176195 (https://phabricator.wikimedia.org/T401294) [09:41:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1017.eqiad.wmnet with reason: host reimage [09:42:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176195 (https://phabricator.wikimedia.org/T401294) (owner: 10Hashar) [09:42:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P80892 and previous config saved to /var/cache/conftool/dbconfig/20250806-094250-fceratto.json [09:43:58] (03CR) 10Btullis: [C:03+2] Remove last references to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1175903 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [09:45:07] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1019.eqiad.wmnet with reason: host reimage [09:45:30] (03Merged) 10jenkins-bot: ExperimentManager: Fix #getExperiment() when uninitialized [extensions/MetricsPlatform] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176195 (https://phabricator.wikimedia.org/T401294) (owner: 10Hashar) [09:45:55] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1176195|ExperimentManager: Fix #getExperiment() when uninitialized (T401294)]] [09:45:58] T401294: PHP Warning: Undefined array key "active_experiments" - https://phabricator.wikimedia.org/T401294 [09:46:25] RESOLVED: SystemdUnitFailed: nginx.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:44] !log hashar@deploy1003 hashar: Backport for [[gerrit:1176195|ExperimentManager: Fix #getExperiment() when uninitialized (T401294)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:48:56] !log hashar@deploy1003 hashar: Continuing with sync [09:50:39] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1016.eqiad.wmnet with OS bookworm [09:54:15] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176195|ExperimentManager: Fix #getExperiment() when uninitialized (T401294)]] (duration: 08m 20s) [09:54:18] T401294: PHP Warning: Undefined array key "active_experiments" - https://phabricator.wikimedia.org/T401294 [09:54:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1018.eqiad.wmnet with OS bookworm [09:57:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T399728)', diff saved to https://phabricator.wikimedia.org/P80893 and previous config saved to /var/cache/conftool/dbconfig/20250806-095758-fceratto.json [09:58:02] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [09:58:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1196.eqiad.wmnet with reason: Maintenance [09:58:33] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:58:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T399728)', diff saved to https://phabricator.wikimedia.org/P80894 and previous config saved to /var/cache/conftool/dbconfig/20250806-095839-fceratto.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1000) [10:00:23] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm [10:02:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T399728)', diff saved to https://phabricator.wikimedia.org/P80895 and previous config saved to /var/cache/conftool/dbconfig/20250806-100220-fceratto.json [10:03:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1019.eqiad.wmnet with OS bookworm [10:17:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P80896 and previous config saved to /var/cache/conftool/dbconfig/20250806-101728-fceratto.json [10:23:49] (03CR) 10Urbanecm: "Lifting my -2, as CommunityConfigurationExample now has the latest two deployment branches." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173924 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm) [10:25:15] 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300 (10OSleger-WMF) 03NEW [10:26:40] (03CR) 10Btullis: [C:03+2] Update flink-operator helm chart to match the upstream release v1.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173407 (https://phabricator.wikimedia.org/T398162) (owner: 10Btullis) [10:32:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P80897 and previous config saved to /var/cache/conftool/dbconfig/20250806-103235-fceratto.json [10:33:32] (03Merged) 10jenkins-bot: Update flink-operator helm chart to match the upstream release v1.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173407 (https://phabricator.wikimedia.org/T398162) (owner: 10Btullis) [10:39:43] (03PS1) 10Jaime Nuche: releases-jenkins: add dpkg options to jenkins package installation [puppet] - 10https://gerrit.wikimedia.org/r/1176198 (https://phabricator.wikimedia.org/T400645) [10:47:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T399728)', diff saved to https://phabricator.wikimedia.org/P80898 and previous config saved to /var/cache/conftool/dbconfig/20250806-104743-fceratto.json [10:47:47] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:47:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:47:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1206.eqiad.wmnet with reason: Maintenance [10:48:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T399728)', diff saved to https://phabricator.wikimedia.org/P80899 and previous config saved to /var/cache/conftool/dbconfig/20250806-104805-fceratto.json [10:48:58] (03PS1) 10Gkyziridis: ml-services: Deploy revertrisk-language-agnostic latest published image on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176199 (https://phabricator.wikimedia.org/T400266) [10:50:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T399728)', diff saved to https://phabricator.wikimedia.org/P80900 and previous config saved to /var/cache/conftool/dbconfig/20250806-105047-fceratto.json [10:50:50] (03CR) 10Bartosz Wójtowicz: [C:03+1] "Thank you for the very swift help, LGTM <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176199 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis) [10:53:08] (03CR) 10Ozge: [C:03+1] ml-services: Deploy revertrisk-language-agnostic latest published image on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176199 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis) [10:54:14] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy revertrisk-language-agnostic latest published image on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176199 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis) [10:55:54] (03Merged) 10jenkins-bot: ml-services: Deploy revertrisk-language-agnostic latest published image on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176199 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis) [10:58:30] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:58:45] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [11:00:05] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1100). nyaa~ [11:00:50] !log btullis@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:02:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:05:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P80901 and previous config saved to /var/cache/conftool/dbconfig/20250806-110555-fceratto.json [11:06:07] 06SRE, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T401118#11064490 (10WMDECyn) confirming this request from WMDE side [11:09:30] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:09:41] !log btullis@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:14:03] !log btullis@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:15:18] !log btullis@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:21:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P80902 and previous config saved to /var/cache/conftool/dbconfig/20250806-112102-fceratto.json [11:22:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:27:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:29:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:36:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T399728)', diff saved to https://phabricator.wikimedia.org/P80903 and previous config saved to /var/cache/conftool/dbconfig/20250806-113609-fceratto.json [11:36:14] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:36:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1218.eqiad.wmnet with reason: Maintenance [11:36:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T399728)', diff saved to https://phabricator.wikimedia.org/P80904 and previous config saved to /var/cache/conftool/dbconfig/20250806-113633-fceratto.json [11:37:36] (03PS1) 10Effie Mouzeli: kube-state-metrics: collect metrics for metadata.labels.username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209 [11:37:54] (03PS2) 10Effie Mouzeli: kube-state-metrics: collect metrics for metadata.labels.username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209 [11:39:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:44:38] !log btullis@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: sync [11:44:40] !log btullis@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: sync [11:44:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:52:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T399728)', diff saved to https://phabricator.wikimedia.org/P80905 and previous config saved to /var/cache/conftool/dbconfig/20250806-115216-fceratto.json [11:52:20] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:53:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:03:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:07:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P80906 and previous config saved to /var/cache/conftool/dbconfig/20250806-120723-fceratto.json [12:11:48] !log btullis@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [12:13:57] !log btullis@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:16:10] (03CR) 10Elukey: "It is yes! I'd personally replace "$1 and $2 are the values captured in the two groups in parentheses in $jbod_re" with an example of befo" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [12:16:22] (03CR) 10Phuedx: [C:03+2] xLab: Deploy v0.8.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175961 (https://phabricator.wikimedia.org/T384107) (owner: 10Santiago Faci) [12:17:14] !log btullis@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:17:51] (03Merged) 10jenkins-bot: xLab: Deploy v0.8.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175961 (https://phabricator.wikimedia.org/T384107) (owner: 10Santiago Faci) [12:17:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:17:55] (03PS2) 10Reedy: thwiki: add WT namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176192 (https://phabricator.wikimedia.org/T401287) (owner: 10Chlod Alejandro) [12:18:07] (03CR) 10Reedy: [C:03+2] thwiki: add WT namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176192 (https://phabricator.wikimedia.org/T401287) (owner: 10Chlod Alejandro) [12:18:22] (03CR) 10Reedy: [C:03+2] "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176192 (https://phabricator.wikimedia.org/T401287) (owner: 10Chlod Alejandro) [12:18:36] !log btullis@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:18:59] (03Merged) 10jenkins-bot: thwiki: add WT namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176192 (https://phabricator.wikimedia.org/T401287) (owner: 10Chlod Alejandro) [12:20:35] 10SRE-SLO, 10EditCheck, 10Lift-Wing, 06Machine-Learning-Team, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11064788 (10elukey) Hey @gkyziridis, nono this is something related to the SLO itself, we'll need to review the targets... [12:22:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P80907 and previous config saved to /var/cache/conftool/dbconfig/20250806-122231-fceratto.json [12:22:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:23:22] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1176192|thwiki: add WT namespace alias (T401287)]] [12:23:25] T401287: "WT" namespace alias for th.wikipedia.org - https://phabricator.wikimedia.org/T401287 [12:25:14] !log reedy@deploy1003 chlod, reedy: Backport for [[gerrit:1176192|thwiki: add WT namespace alias (T401287)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:25:55] 10ops-codfw, 06DC-Ops: Add scs-e3-codfw to monitoring - https://phabricator.wikimedia.org/T401310 (10ayounsi) 03NEW [12:26:44] !log reedy@deploy1003 chlod, reedy: Continuing with sync [12:26:59] (03PS1) 10Ayounsi: Rancid: add SR-Linux support [puppet] - 10https://gerrit.wikimedia.org/r/1176216 [12:27:01] (03PS12) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) [12:27:25] (03CR) 10CI reject: [V:04-1] Rancid: add SR-Linux support [puppet] - 10https://gerrit.wikimedia.org/r/1176216 (owner: 10Ayounsi) [12:27:49] 10ops-eqiad, 06DC-Ops: Investigate whether we can add RAM to dumpsdata100[4-7] from any decommissioned hosts - https://phabricator.wikimedia.org/T401299#11064837 (10BTullis) [12:28:09] (03CR) 10MVernon: "How about this? :)" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [12:29:25] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm [12:31:39] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [12:32:18] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176192|thwiki: add WT namespace alias (T401287)]] (duration: 08m 56s) [12:32:21] T401287: "WT" namespace alias for th.wikipedia.org - https://phabricator.wikimedia.org/T401287 [12:32:33] (03PS2) 10Ayounsi: Rancid: add SR-Linux support [puppet] - 10https://gerrit.wikimedia.org/r/1176216 [12:33:12] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1176216 (owner: 10Ayounsi) [12:33:24] !log run namespaceDupes.php on thwiki T401287 [12:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:03] (03PS1) 10Brouberol: Provision dse-k8s-worker1015 [puppet] - 10https://gerrit.wikimedia.org/r/1176218 (https://phabricator.wikimedia.org/T398438) [12:34:05] (03PS1) 10Brouberol: Provision dse-k8s-worker1016 [puppet] - 10https://gerrit.wikimedia.org/r/1176219 (https://phabricator.wikimedia.org/T398438) [12:34:07] (03PS1) 10Brouberol: Provision dse-k8s-worker1017 [puppet] - 10https://gerrit.wikimedia.org/r/1176220 (https://phabricator.wikimedia.org/T398438) [12:34:09] (03PS1) 10Brouberol: Provision dse-k8s-worker1018 [puppet] - 10https://gerrit.wikimedia.org/r/1176221 (https://phabricator.wikimedia.org/T398438) [12:34:11] (03PS1) 10Brouberol: Provision dse-k8s-worker1019 [puppet] - 10https://gerrit.wikimedia.org/r/1176222 (https://phabricator.wikimedia.org/T398438) [12:35:00] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:35:17] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2245 to codfw - jhancock@cumin1003" [12:35:21] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2245 to codfw - jhancock@cumin1003" [12:35:21] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:35:32] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db2245 [12:35:33] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db2246 [12:35:34] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db2247 [12:35:35] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db2248 [12:35:41] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2245 [12:35:42] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2246 [12:35:45] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2247 [12:35:46] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2248 [12:36:15] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2245.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:36:41] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2246.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:37:11] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2247.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:37:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T399728)', diff saved to https://phabricator.wikimedia.org/P80908 and previous config saved to /var/cache/conftool/dbconfig/20250806-123738-fceratto.json [12:37:42] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:37:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1219.eqiad.wmnet with reason: Maintenance [12:37:47] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2248.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:37:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T399728)', diff saved to https://phabricator.wikimedia.org/P80909 and previous config saved to /var/cache/conftool/dbconfig/20250806-123751-fceratto.json [12:39:59] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2245.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:41:36] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [12:41:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T399728)', diff saved to https://phabricator.wikimedia.org/P80910 and previous config saved to /var/cache/conftool/dbconfig/20250806-124140-fceratto.json [12:42:13] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [12:43:56] (03PS1) 10Chlod Alejandro: Add maintenance script to recapitalize 'Nuke' tags [extensions/Nuke] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1176225 (https://phabricator.wikimedia.org/T381598) [12:46:29] (03CR) 10Reedy: [C:03+2] Add maintenance script to recapitalize 'Nuke' tags [extensions/Nuke] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1176225 (https://phabricator.wikimedia.org/T381598) (owner: 10Chlod Alejandro) [12:49:41] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bookworm [12:49:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:49:55] jhancock@cumin1003 provision (PID 848282) is awaiting input [12:51:27] (03CR) 10Elukey: [C:03+1] swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [12:51:53] 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065008 (10SLopes-WMF) As Otto's manager, I approve this request. [12:53:17] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2246.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:53:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11065013 (10elukey) To keep archives happy, late_command.sh fails. Reporting what I wrote on IRC to the Traffic team: ` All right back testing late_command on... [12:54:30] (03CR) 10MVernon: [C:03+2] swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [12:54:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:55:19] (03Merged) 10jenkins-bot: Add maintenance script to recapitalize 'Nuke' tags [extensions/Nuke] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1176225 (https://phabricator.wikimedia.org/T381598) (owner: 10Chlod Alejandro) [12:55:20] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2247.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:56:40] !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1091.eqiad.wmnet with OS bullseye [12:56:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P80911 and previous config saved to /var/cache/conftool/dbconfig/20250806-125648-fceratto.json [12:56:52] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1176225|Add maintenance script to recapitalize 'Nuke' tags (T381598)]] [12:56:53] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127#11065029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1091.eqiad.wmnet with OS bullseye [12:56:55] T381598: Create and run a maintenance script to rename incorrectly capitalised Nuke-tagged log entries [2HRS] - https://phabricator.wikimedia.org/T381598 [12:57:09] (03CR) 10Btullis: [C:03+1] Provision dse-k8s-worker1015 [puppet] - 10https://gerrit.wikimedia.org/r/1176218 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol) [12:57:29] (03CR) 10Btullis: [C:03+1] Provision dse-k8s-worker1016 [puppet] - 10https://gerrit.wikimedia.org/r/1176219 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol) [12:57:37] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bullseye [12:57:41] (03CR) 10Btullis: [C:03+1] Provision dse-k8s-worker1017 [puppet] - 10https://gerrit.wikimedia.org/r/1176220 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol) [12:57:56] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127#11065034 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2088.codfw.wmnet with OS bullseye [12:58:00] (03CR) 10Btullis: [C:03+1] Provision dse-k8s-worker1018 [puppet] - 10https://gerrit.wikimedia.org/r/1176221 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol) [12:58:15] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2248.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:58:23] (03CR) 10Btullis: [C:03+1] Provision dse-k8s-worker1019 [puppet] - 10https://gerrit.wikimedia.org/r/1176222 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol) [12:58:35] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11065045 (10Jhancock.wm) a:05Marostegui→03Jhancock.wm [12:58:44] !log reedy@deploy1003 chlod, reedy: Backport for [[gerrit:1176225|Add maintenance script to recapitalize 'Nuke' tags (T381598)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:59:17] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:59:48] !log reedy@deploy1003 chlod, reedy: Continuing with sync [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1300). Please do the needful. [13:00:05] sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:22] o/ [13:00:42] o/ [13:00:49] want to self-service your beta change? ^^ [13:00:55] Sure [13:01:48] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11065069 (10Jhancock.wm) db2245 didn't pass provision. will investigate. [13:02:42] Oh, it seems @Reedy has locked backporting, I'll come back in 10min [13:03:36] Hi, I am seeing `14:55:59 npm warn tar TAR_ENTRY_ERROR ENOSPC: no space left on device, write` errors in Jenkins jobs. See for example https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/32039/console [13:04:13] physikerwelt: #wikimedia-releng is probably the better channel for that IIUC [13:04:17] yes [13:04:22] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host dbprov2007.codfw.wmnet with OS bookworm [13:04:25] also, the latest message by wmf-insecte in there suggests the disk space got freed up again [13:04:33] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11065095 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host dbprov2007.codfw.wmnet with OS bookworm [13:04:40] was 98%, now back to 42% [13:04:58] Lucas_WMDE: sorry, thank you. [13:05:11] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176225|Add maintenance script to recapitalize 'Nuke' tags (T381598)]] (duration: 08m 18s) [13:05:14] T381598: Create and run a maintenance script to rename incorrectly capitalised Nuke-tagged log entries [2HRS] - https://phabricator.wikimedia.org/T381598 [13:05:50] !log committing new homer config to add dse-k8s-worker101[5-9] to the bgp groups [13:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:31] Reedy: can sergi0 deploy or do you need something else backported? (I assume running the maint script shouldn’t conflict with another deployment) [13:08:34] (03CR) 10Brouberol: [C:03+2] Provision dse-k8s-worker1015 [puppet] - 10https://gerrit.wikimedia.org/r/1176218 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol) [13:08:58] (03PS1) 10Genoveva Galarza: wikifunctions: Upgrade orchestrator from 2025-07-29-155618 to 2025-08-01-154925 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176230 (https://phabricator.wikimedia.org/T351458) [13:08:59] !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1091.eqiad.wmnet with reason: host reimage [13:09:17] scap says is unlocked now so I'm going ahead [13:10:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175919 (https://phabricator.wikimedia.org/T400118) (owner: 10Sergio Gimeno) [13:11:11] (03Merged) 10jenkins-bot: [Growth] beta: enable new leveling up notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175919 (https://phabricator.wikimedia.org/T400118) (owner: 10Sergio Gimeno) [13:11:37] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage [13:11:44] !log ran `foreachwiki extensions/Nuke/maintenance/normalizeNukeTags.php` T381598 [13:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:47] T381598: Create and run a maintenance script to rename incorrectly capitalised Nuke-tagged log entries [2HRS] - https://phabricator.wikimedia.org/T381598 [13:11:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P80912 and previous config saved to /var/cache/conftool/dbconfig/20250806-131155-fceratto.json [13:12:23] done [13:14:19] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1091.eqiad.wmnet with reason: host reimage [13:14:23] !log UTC afternoon backport+config window done [13:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:54] (03CR) 10Dr0ptp4kt: "Thanks @rlazarus@wikimedia.org! Best if @phuedx@wikimedia.org chimes in (he's tech lead on the Experimentation Lab ("xLab" for short - I m" [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [13:18:28] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2012.codfw.wmnet w/ force delete existing files, repooling both afterwards [13:18:31] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [13:19:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage [13:20:37] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1011.eqiad.wmnet w/ force delete existing files, repooling both afterwards [13:21:33] (03CR) 10Brouberol: [C:03+2] Provision dse-k8s-worker1016 [puppet] - 10https://gerrit.wikimedia.org/r/1176219 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol) [13:21:38] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2007.codfw.wmnet with reason: host reimage [13:21:40] (03PS2) 10Brouberol: Provision dse-k8s-worker1016 [puppet] - 10https://gerrit.wikimedia.org/r/1176219 (https://phabricator.wikimedia.org/T398438) [13:22:34] (03PS1) 10Genoveva Galarza: wikifunctions: Upgrade evaluators from 2025-07-30-130544 to 2025-08-05-075031 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176234 (https://phabricator.wikimedia.org/T386794) [13:25:09] (03PS3) 10Brouberol: Provision dse-k8s-worker1016 [puppet] - 10https://gerrit.wikimedia.org/r/1176219 (https://phabricator.wikimedia.org/T398438) [13:25:09] (03PS2) 10Brouberol: Provision dse-k8s-worker1017 [puppet] - 10https://gerrit.wikimedia.org/r/1176220 (https://phabricator.wikimedia.org/T398438) [13:25:09] (03PS2) 10Brouberol: Provision dse-k8s-worker1018 [puppet] - 10https://gerrit.wikimedia.org/r/1176221 (https://phabricator.wikimedia.org/T398438) [13:25:09] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2007.codfw.wmnet with reason: host reimage [13:25:10] (03PS2) 10Brouberol: Provision dse-k8s-worker1019 [puppet] - 10https://gerrit.wikimedia.org/r/1176222 (https://phabricator.wikimedia.org/T398438) [13:25:23] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: disk sdj failure for cloudcephosd1013.eqiad.wmnet - https://phabricator.wikimedia.org/T401319 (10fnegri) 03NEW [13:26:01] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: disk sdj failure for cloudcephosd1013.eqiad.wmnet - https://phabricator.wikimedia.org/T401319#11065222 (10fnegri) [13:27:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T399728)', diff saved to https://phabricator.wikimedia.org/P80913 and previous config saved to /var/cache/conftool/dbconfig/20250806-132703-fceratto.json [13:27:07] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:27:18] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1232.eqiad.wmnet with reason: Maintenance [13:27:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T399728)', diff saved to https://phabricator.wikimedia.org/P80914 and previous config saved to /var/cache/conftool/dbconfig/20250806-132725-fceratto.json [13:27:35] (03CR) 10Brouberol: [C:03+2] Provision dse-k8s-worker1016 [puppet] - 10https://gerrit.wikimedia.org/r/1176219 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol) [13:27:37] (03CR) 10Brouberol: [C:03+2] Provision dse-k8s-worker1017 [puppet] - 10https://gerrit.wikimedia.org/r/1176220 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol) [13:27:40] (03CR) 10Brouberol: [C:03+2] Provision dse-k8s-worker1018 [puppet] - 10https://gerrit.wikimedia.org/r/1176221 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol) [13:27:47] (03CR) 10Brouberol: [C:03+2] Provision dse-k8s-worker1019 [puppet] - 10https://gerrit.wikimedia.org/r/1176222 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol) [13:29:26] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1091.eqiad.wmnet with OS bullseye [13:29:47] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127#11065253 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1091.eqiad.wmnet with OS bullseye completed: - ms-be1... [13:31:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T399728)', diff saved to https://phabricator.wikimedia.org/P80915 and previous config saved to /var/cache/conftool/dbconfig/20250806-133115-fceratto.json [13:36:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2088.codfw.wmnet with OS bullseye [13:36:14] 06SRE, 10SRE-swift-storage: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127#11065287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2088.codfw.wmnet with OS bullseye completed: - ms-be2088 (**PASS**) - Dow... [13:37:04] (03PS1) 10Brouberol: site: assign dse_k8s::worker role to dse-k8s-worker1019 [puppet] - 10https://gerrit.wikimedia.org/r/1176243 (https://phabricator.wikimedia.org/T398438) [13:39:18] (03CR) 10Brouberol: [C:03+2] site: assign dse_k8s::worker role to dse-k8s-worker1019 [puppet] - 10https://gerrit.wikimedia.org/r/1176243 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol) [13:40:46] 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065306 (10ayounsi) [13:46:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P80916 and previous config saved to /var/cache/conftool/dbconfig/20250806-134623-fceratto.json [13:47:09] 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065311 (10ayounsi) @ssastry hello, we also need your approval to add @OSleger-WMF to `parsoid-admin` (cf. https://gerrit.wikimedia.org/r/plugins/gitiles/operation... [13:47:11] (03PS1) 10Jelto: gitlab: adjust nftables throttling thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1176246 (https://phabricator.wikimedia.org/T400971) [13:47:20] 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065313 (10ayounsi) [13:48:01] (03PS2) 10Jelto: gitlab: adjust nftables throttling thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1176246 (https://phabricator.wikimedia.org/T400971) [13:49:12] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [13:49:16] (03PS1) 10Cwhite: prometheus: make extra_config field optional [puppet] - 10https://gerrit.wikimedia.org/r/1176247 [13:50:16] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [13:50:17] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov2007.codfw.wmnet with OS bookworm [13:50:28] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11065317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host dbprov2007.codfw.wmnet with OS bookworm completed: - dbprov2007 (**WARN*... [13:51:34] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6506/co" [puppet] - 10https://gerrit.wikimedia.org/r/1176246 (https://phabricator.wikimedia.org/T400971) (owner: 10Jelto) [13:53:55] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11065323 (10Jhancock.wm) [13:54:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11065324 (10Jhancock.wm) 05Open→03Resolved @jcrespo this is complete [13:59:14] 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065367 (10ayounsi) [13:59:43] (03PS3) 10Hashar: build: upgrade QUnit [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1175475 [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1400) [14:00:08] (03PS1) 10Effie Mouzeli: profile::hcaptcha::proxy: config improvements [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) [14:01:08] (03CR) 10Hashar: "I have made some code adjustment after `QUnit.test.each` learned to output nice labels when being fed an array ( https://github.com/qunitj" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1175475 (owner: 10Hashar) [14:01:26] (03PS2) 10Effie Mouzeli: profile::hcaptcha::proxy: config improvements [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) [14:01:30] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-07-29-155618 to 2025-08-01-154925 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176230 (https://phabricator.wikimedia.org/T351458) (owner: 10Genoveva Galarza) [14:01:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P80917 and previous config saved to /var/cache/conftool/dbconfig/20250806-140130-fceratto.json [14:01:32] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli) [14:03:21] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-07-29-155618 to 2025-08-01-154925 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176230 (https://phabricator.wikimedia.org/T351458) (owner: 10Genoveva Galarza) [14:05:54] !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:06:41] !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:07:14] !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:07:44] !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:07:54] !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:08:23] !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:08:52] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade evaluators from 2025-07-30-130544 to 2025-08-05-075031 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176234 (https://phabricator.wikimedia.org/T386794) (owner: 10Genoveva Galarza) [14:11:04] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-07-30-130544 to 2025-08-05-075031 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176234 (https://phabricator.wikimedia.org/T386794) (owner: 10Genoveva Galarza) [14:12:25] !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:13:09] !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:13:27] !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:14:06] !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:14:14] !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:14:22] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:14:44] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:14:54] !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:14:57] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:15:50] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:16:25] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:16:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T399728)', diff saved to https://phabricator.wikimedia.org/P80918 and previous config saved to /var/cache/conftool/dbconfig/20250806-141638-fceratto.json [14:16:42] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [14:16:54] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1234.eqiad.wmnet with reason: Maintenance [14:17:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T399728)', diff saved to https://phabricator.wikimedia.org/P80919 and previous config saved to /var/cache/conftool/dbconfig/20250806-141701-fceratto.json [14:17:22] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:18:03] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2012.codfw.wmnet w/ force delete existing files, repooling both afterwards [14:18:06] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [14:18:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [14:19:05] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1011.eqiad.wmnet w/ force delete existing files, repooling both afterwards [14:20:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T399728)', diff saved to https://phabricator.wikimedia.org/P80920 and previous config saved to /var/cache/conftool/dbconfig/20250806-142046-fceratto.json [14:23:27] (03PS1) 10Zabe: Do not create a database table when a different provider is used [extensions/ApiFeatureUsage] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1176250 (https://phabricator.wikimedia.org/T397348) [14:23:40] (03PS1) 10Zabe: Do not create a database table when a different provider is used [extensions/ApiFeatureUsage] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176251 (https://phabricator.wikimedia.org/T397348) [14:23:56] PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 2703 MB (3% inode=89%): /tmp 2703 MB (3% inode=89%): /var/tmp 2703 MB (3% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [14:25:25] (03CR) 10Effie Mouzeli: [C:03+1] php8.1: rebuild to pick up 8.1.33-1+wmf11u2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1175951 (https://phabricator.wikimedia.org/T383047) (owner: 10Scott French) [14:29:02] (03PS1) 10MVernon: swift: remove old nodes, drain & reweight SM C-J nodes [puppet] - 10https://gerrit.wikimedia.org/r/1176253 (https://phabricator.wikimedia.org/T391354) [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1400) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1430) [14:31:15] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1176253 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [14:33:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [14:34:13] (03CR) 10Zabe: multiversion: Move remaining dblist helper to WmfConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 (owner: 10Krinkle) [14:35:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P80921 and previous config saved to /var/cache/conftool/dbconfig/20250806-143554-fceratto.json [14:36:31] (03CR) 10Eevans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1176253 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [14:37:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:39:39] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127#11065568 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Demonstrating how this works, you can see that the two systems with these controllers in hav... [14:40:19] (03PS1) 10Sergio Gimeno: [Growth] Remove get-started notification variant delays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176254 [14:40:59] (03CR) 10MVernon: [C:03+2] swift: remove old nodes, drain & reweight SM C-J nodes [puppet] - 10https://gerrit.wikimedia.org/r/1176253 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [14:41:41] (03CR) 10Krinkle: multiversion: Move remaining dblist helper to WmfConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 (owner: 10Krinkle) [14:47:26] (03CR) 10Zabe: multiversion: Move remaining dblist helper to WmfConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 (owner: 10Krinkle) [14:47:49] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11065593 (10MatthewVernon) [14:47:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:48:02] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli) [14:50:10] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2013.codfw.wmnet w/ force delete existing files, repooling both afterwards [14:50:13] (03CR) 10Tchanders: [C:04-1] "Looks like we can solve this by just moving the assignment of the edit right to the 'temp' group (i.e. removing the block linked to above)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [14:50:13] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [14:50:39] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1012.eqiad.wmnet w/ force delete existing files, repooling both afterwards [14:50:50] (03CR) 10Krinkle: multiversion: Move remaining dblist helper to WmfConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 (owner: 10Krinkle) [14:51:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P80922 and previous config saved to /var/cache/conftool/dbconfig/20250806-145101-fceratto.json [14:57:27] (03PS1) 10MVernon: swift: remove ms-be106[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1176258 (https://phabricator.wikimedia.org/T391354) [15:06:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T399728)', diff saved to https://phabricator.wikimedia.org/P80923 and previous config saved to /var/cache/conftool/dbconfig/20250806-150609-fceratto.json [15:06:13] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [15:06:24] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1235.eqiad.wmnet with reason: Maintenance [15:06:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T399728)', diff saved to https://phabricator.wikimedia.org/P80924 and previous config saved to /var/cache/conftool/dbconfig/20250806-150631-fceratto.json [15:07:28] (03PS1) 10Brouberol: Update the image tag associated with PG 15 [puppet] - 10https://gerrit.wikimedia.org/r/1176261 (https://phabricator.wikimedia.org/T396037) [15:09:30] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:45] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T399728)', diff saved to https://phabricator.wikimedia.org/P80925 and previous config saved to /var/cache/conftool/dbconfig/20250806-151017-fceratto.json [15:17:58] 06SRE-OnFire, 10WMDE-TechWish-Maintenance, 10Sustainability (Incident Followup): Split out reusable Parsoid+Cite analysis module from scraper - https://phabricator.wikimedia.org/T401334 (10awight) 03NEW [15:19:25] (03PS1) 10Andrew Bogott: Magnum: require 'helm3' rather than 'helm' [puppet] - 10https://gerrit.wikimedia.org/r/1176262 [15:19:45] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:20:05] (03CR) 10Andrew Bogott: [C:03+2] Magnum: require 'helm3' rather than 'helm' [puppet] - 10https://gerrit.wikimedia.org/r/1176262 (owner: 10Andrew Bogott) [15:20:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:22:35] 06SRE-OnFire, 10Cite (Sub-referencing), 10Sustainability (Incident Followup): Spike: define operational monitoring requirements for Cite error alerting - https://phabricator.wikimedia.org/T401335 (10awight) 03NEW [15:25:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P80926 and previous config saved to /var/cache/conftool/dbconfig/20250806-152524-fceratto.json [15:26:15] (03CR) 10Clément Goubert: [C:04-1] kube-state-metrics: collect metrics for metadata.labels.username (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209 (owner: 10Effie Mouzeli) [15:26:27] 06SRE-OnFire, 10Cite, 10VisualEditor, 13Patch-For-Review, and 4 others: Investigation: Write visual editor debug tool to produce Converter test cases - https://phabricator.wikimedia.org/T400311#11065763 (10awight) [15:26:50] 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065765 (10ssastry) I approve. [15:27:20] 06SRE-OnFire, 10Cite, 10Cite (Sub-referencing), 10Sustainability (Incident Followup), 03WMDE-TechWish-Sprint-Cherry-Chocolate-Ice-Cream-2025-07-23: Tech debt: review uses of references list item id during Parsoid html2wt - https://phabricator.wikimedia.org/T400803#11065766 (10awight) [15:29:29] 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065767 (10ABreault-WMF) I think he also needs `parsoid-test-roots` https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules... [15:31:10] 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065770 (10ssastry) The broader request is to add Otto to all groups that the other members of content-transform-team are part of. Thanks! [15:37:02] (03PS3) 10Effie Mouzeli: kube-state-metrics: collect metrics for metadata.labels.username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209 [15:37:12] (03CR) 10Stevemunene: [C:03+1] datahub: increase memory for frontend and mae-consumer pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176187 (https://phabricator.wikimedia.org/T398599) (owner: 10Brouberol) [15:37:26] (03CR) 10Brouberol: [C:03+2] datahub: increase memory for frontend and mae-consumer pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176187 (https://phabricator.wikimedia.org/T398599) (owner: 10Brouberol) [15:39:30] 06SRE, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T401118#11065825 (10KFrancis) Hi all, confirming receipt of this request. Please confirm Halima Sadiya Mohammed is the user's full name and please p... [15:40:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P80927 and previous config saved to /var/cache/conftool/dbconfig/20250806-154032-fceratto.json [15:42:54] !log dancy@deploy1003 Installing scap version "4.197.0" for 169 host(s) [15:44:06] (03CR) 10Brouberol: [C:03+1] Add collation to the list of sqooped table [puppet] - 10https://gerrit.wikimedia.org/r/1175924 (https://phabricator.wikimedia.org/T397923) (owner: 10Aleksandar Mastilovic) [15:46:04] (03CR) 10MVernon: "If you could eyeball this today-your-working-day, please, I can deploy tomorrow-my-working-day and get the hosts decommissioned. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1176258 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [15:46:41] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2013.codfw.wmnet w/ force delete existing files, repooling both afterwards [15:46:49] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [15:48:05] !log dancy@deploy1003 Installation of scap version "4.197.0" completed for 169 hosts [15:48:45] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1012.eqiad.wmnet w/ force delete existing files, repooling both afterwards [15:49:57] (03CR) 10Dzahn: [C:03+1] gitlab: adjust nftables throttling thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1176246 (https://phabricator.wikimedia.org/T400971) (owner: 10Jelto) [15:52:25] (03CR) 10Hashar: [C:03+1] gerrit: add daemons ssh host key to known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar) [15:55:20] (03PS1) 10Santiago Faci: xLab: Deploy v0.8.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176266 (https://phabricator.wikimedia.org/T401316) [15:55:37] (03PS2) 10Santiago Faci: xLab: Deploy v0.8.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176266 (https://phabricator.wikimedia.org/T401316) [15:55:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T399728)', diff saved to https://phabricator.wikimedia.org/P80928 and previous config saved to /var/cache/conftool/dbconfig/20250806-155540-fceratto.json [15:55:44] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [15:55:56] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance [15:57:25] (03PS1) 10Santiago Faci: xLab: Deploy v0.8.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176267 (https://phabricator.wikimedia.org/T401316) [15:57:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance [15:59:33] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1251.eqiad.wmnet with reason: Maintenance [15:59:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1251 (T399728)', diff saved to https://phabricator.wikimedia.org/P80929 and previous config saved to /var/cache/conftool/dbconfig/20250806-155939-fceratto.json [16:00:41] (03CR) 10Effie Mouzeli: kube-state-metrics: collect metrics for metadata.labels.username (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209 (owner: 10Effie Mouzeli) [16:00:44] (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.8.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176266 (https://phabricator.wikimedia.org/T401316) (owner: 10Santiago Faci) [16:00:57] (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.8.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176267 (https://phabricator.wikimedia.org/T401316) (owner: 10Santiago Faci) [16:02:44] (03Merged) 10jenkins-bot: xLab: Deploy v0.8.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176266 (https://phabricator.wikimedia.org/T401316) (owner: 10Santiago Faci) [16:03:02] (03Merged) 10jenkins-bot: xLab: Deploy v0.8.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176267 (https://phabricator.wikimedia.org/T401316) (owner: 10Santiago Faci) [16:03:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T399728)', diff saved to https://phabricator.wikimedia.org/P80930 and previous config saved to /var/cache/conftool/dbconfig/20250806-160323-fceratto.json [16:03:28] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:04:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:05:16] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:05:40] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:06:30] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:07:06] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54369 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:07:16] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:07:58] (03CR) 10Clément Goubert: [C:03+1] kube-state-metrics: collect metrics for metadata.labels.username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209 (owner: 10Effie Mouzeli) [16:13:49] (03CR) 10Dzahn: [C:03+2] "thank you, makes sense, checked man page" [puppet] - 10https://gerrit.wikimedia.org/r/1176198 (https://phabricator.wikimedia.org/T400645) (owner: 10Jaime Nuche) [16:16:39] (03Abandoned) 10Dzahn: phabricator: block some scrapers and bots at apache level [puppet] - 10https://gerrit.wikimedia.org/r/1175933 (owner: 10Dzahn) [16:18:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P80931 and previous config saved to /var/cache/conftool/dbconfig/20250806-161831-fceratto.json [16:20:32] PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 155062 MB (4% inode=99%): /var/lib/hadoop/data/h 155405 MB (4% inode=99%): /var/lib/hadoop/data/b 167701 MB (4% inode=99%): /var/lib/hadoop/data/k 145243 MB (3% inode=99%): /var/lib/hadoop/data/m 153546 MB (4% inode=99%): /var/lib/hadoop/data/f 158147 MB (4% inode=99%): /var/lib/hadoop/data/j 157955 MB (4% inode=99%): /var/lib/hadoop/data [16:20:32] 2 MB (4% inode=99%): /var/lib/hadoop/data/l 164977 MB (4% inode=99%): /var/lib/hadoop/data/i 151408 MB (4% inode=99%): /var/lib/hadoop/data/g 152199 MB (4% inode=99%): /var/lib/hadoop/data/c 155277 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops [16:24:32] (03CR) 10Cwhite: [C:03+1] "LGTM - LMK if you need help rolling this out." [puppet] - 10https://gerrit.wikimedia.org/r/1175928 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming) [16:24:40] (03CR) 10Scott French: profile::hcaptcha::proxy: config improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli) [16:30:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:32:37] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [16:33:12] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [16:33:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P80934 and previous config saved to /var/cache/conftool/dbconfig/20250806-163338-fceratto.json [16:34:59] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [16:37:51] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [16:37:51] (03CR) 10Federico Ceratto: "I see the 3 hosts already drained in modules/swift/files/eqiad-prod_hosts.yaml and they match the regex ms-be106[1-3] as described" [puppet] - 10https://gerrit.wikimedia.org/r/1176258 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [16:38:25] (03CR) 10RLazarus: [C:03+2] Alertmanager: add receiver and routing for experiment-platform tasks [puppet] - 10https://gerrit.wikimedia.org/r/1175928 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming) [16:38:30] (03CR) 10Federico Ceratto: [C:03+1] "LGTM, see previous comment" [puppet] - 10https://gerrit.wikimedia.org/r/1176258 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [16:45:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T399728)', diff saved to https://phabricator.wikimedia.org/P80935 and previous config saved to /var/cache/conftool/dbconfig/20250806-164846-fceratto.json [16:48:51] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:49:02] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [16:49:31] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:52:51] What have we got here? [16:53:16] gerrit restart [16:53:44] gotcha, ty [16:54:07] trying to revive it [16:57:12] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:57:12] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:57:12] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:57:24] oh wow [16:57:34] ah gerrit :) [16:57:43] I think I just got it back [16:57:45] wfm now [16:57:53] thanks, forcing recheck [16:58:20] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:58:20] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:58:21] I should port this over to alert manager [16:58:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:59:08] Reedy: ok now, right? [16:59:31] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:00:04] swfrench-wmf: Your horoscope predicts another MediaWiki infrastructure (UTC late) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1700). [17:00:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:00:36] o/ [17:00:45] I'll be getting started here in a bit [17:01:11] o/ [17:01:46] thanks for sticking around, tgr :) [17:01:54] I'll keep you posted on when things are ready to test [17:02:16] (03CR) 10Clare Ming: "thanks @cwhite@wikimedia.org! i added this patch to the bonus puppet window that @rlazarus@wikimedia.org set up for us -- hope that works " [puppet] - 10https://gerrit.wikimedia.org/r/1175928 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming) [17:03:19] (03CR) 10RLazarus: [C:03+2] "No worries, it's all deployed! I figured this one didn't need any coordination so I just took care of it." [puppet] - 10https://gerrit.wikimedia.org/r/1175928 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming) [17:03:54] !log reprepro include php8.1_8.1.33-1+wmf11u2 in component/php81 - T383047 [17:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:58] T383047: Could not send confirmation email: Unknown error in PHP's mail() function. - https://phabricator.wikimedia.org/T383047 [17:05:33] (03CR) 10RLazarus: [C:03+1] kube-state-metrics: collect metrics for metadata.labels.username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209 (owner: 10Effie Mouzeli) [17:06:18] (03CR) 10Scott French: [V:03+2] "Build locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1175951 (https://phabricator.wikimedia.org/T383047) (owner: 10Scott French) [17:06:27] (03CR) 10Scott French: [V:03+2] "Thanks for the review, Effie!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1175951 (https://phabricator.wikimedia.org/T383047) (owner: 10Scott French) [17:06:44] (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: rebuild to pick up 8.1.33-1+wmf11u2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1175951 (https://phabricator.wikimedia.org/T383047) (owner: 10Scott French) [17:10:40] !log built and published php8.1 production image stack at 8.1.33-1-s3 - T383047 [17:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:43] T383047: Could not send confirmation email: Unknown error in PHP's mail() function. - https://phabricator.wikimedia.org/T383047 [17:11:21] !log swfrench@deploy1003 Started scap sync-world: Deployment to pick up new 8.1.33-1-s3 production images - T383047 [17:12:22] tgr: since this requires a full image rebuild, it'll probably 15-20m until the new image is live in mw-debug [17:15:05] !log amastilovic@deploy1003 Started deploy [analytics/refinery@2178dda] (hadoop-test): Updates to sqoop TEST [analytics/refinery@2178dda8] [17:15:08] (03PS1) 10Dzahn: gerrit: add an IP to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1176288 [17:15:59] !log amastilovic@deploy1003 Finished deploy [analytics/refinery@2178dda] (hadoop-test): Updates to sqoop TEST [analytics/refinery@2178dda8] (duration: 00m 53s) [17:16:54] !log amastilovic@deploy1003 Started deploy [analytics/refinery@2178dda]: Updates to sqoop [analytics/refinery@2178dda8] [17:17:23] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2014.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:17:28] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [17:19:08] (03CR) 10Dzahn: [C:03+2] gerrit: add an IP to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1176288 (owner: 10Dzahn) [17:19:23] !log amastilovic@deploy1003 Finished deploy [analytics/refinery@2178dda]: Updates to sqoop [analytics/refinery@2178dda8] (duration: 02m 29s) [17:19:34] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1013.eqiad.wmnet w/ force delete existing files, repooling both afterwards [17:19:43] !log amastilovic@deploy1003 Started deploy [analytics/refinery@2178dda] (thin): Updates to sqoop THIN [analytics/refinery@2178dda8] [17:20:51] !log amastilovic@deploy1003 Finished deploy [analytics/refinery@2178dda] (thin): Updates to sqoop THIN [analytics/refinery@2178dda8] (duration: 01m 08s) [17:30:38] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2245.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:32:52] !log swfrench@deploy1003 swfrench: Deployment to pick up new 8.1.33-1-s3 production images - T383047 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:32:55] T383047: Could not send confirmation email: Unknown error in PHP's mail() function. - https://phabricator.wikimedia.org/T383047 [17:33:29] tgr: we're live in mw-debug if there's anything you'd like to check there [17:33:56] thanks, checking [17:34:33] just successfully Special:EmailUser'd myself, so at least I've not borked anything horribly, heh [17:38:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:40:17] mutante: is the gerrit probe failure expected? [17:42:37] bd808: no. but we just blocked some abuse.. [17:42:56] swfrench-wmf: I went through the common workflows involving email, they all work [17:43:09] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:43:12] tgr: amazing, thank you very much! [17:43:42] I'll continue, and we can see how things improve w.r.t. error handling over the next couple of hours [17:43:54] bd808: should recover in a sec but WIP [17:44:31] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:44:36] !log swfrench@deploy1003 swfrench: Continuing with sync [17:47:52] mutante: ack. thanks for chasing the ghosts that keep messing with us [17:48:09] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:52:17] jhancock@cumin1003 provision (PID 882499) is awaiting input [17:53:09] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:54:31] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:56:05] !log swfrench@deploy1003 Finished scap sync-world: Deployment to pick up new 8.1.33-1-s3 production images - T383047 (duration: 45m 10s) [17:56:08] T383047: Could not send confirmation email: Unknown error in PHP's mail() function. - https://phabricator.wikimedia.org/T383047 [17:56:45] alright, I should be done with (what little remains of) the infra window [17:58:09] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:58:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1800) [18:03:54] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2245.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:06:56] (03PS1) 10CDanis: benthos: webrequest_sampled_live: remove client_port [puppet] - 10https://gerrit.wikimedia.org/r/1176295 (https://phabricator.wikimedia.org/T398236) [18:06:58] (03PS1) 10CDanis: turnilo: webrequest_sampled_live: remove client_port [puppet] - 10https://gerrit.wikimedia.org/r/1176296 (https://phabricator.wikimedia.org/T398236) [18:11:42] (03PS1) 10Dzahn: gerrit: block abuse from Alibaba Cloud / aliyun [puppet] - 10https://gerrit.wikimedia.org/r/1176297 [18:11:57] (03CR) 10CI reject: [V:04-1] gerrit: block abuse from Alibaba Cloud / aliyun [puppet] - 10https://gerrit.wikimedia.org/r/1176297 (owner: 10Dzahn) [18:12:10] (03PS2) 10Dzahn: gerrit: block abuse from Alibaba Cloud / aliyun [puppet] - 10https://gerrit.wikimedia.org/r/1176297 [18:13:49] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2014.codfw.wmnet w/ force delete existing files, repooling both afterwards [18:13:52] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [18:15:52] (03CR) 10Dzahn: [C:03+2] gerrit: block abuse from Alibaba Cloud / aliyun [puppet] - 10https://gerrit.wikimedia.org/r/1176297 (owner: 10Dzahn) [18:17:00] brennen: I see that the train rolled during the earlier window. would it be alright if I sneak in some infra-related changes (enabling PHP 8.3 image builds) during this window? [18:17:44] swfrench-wmf: yep, nothing train-related for this window. go for it. [18:17:50] amazing [18:17:54] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1013.eqiad.wmnet w/ force delete existing files, repooling both afterwards [18:18:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:23:03] FYI, I won't be taking any action until closer to 19:00 UTC [18:23:09] FIRING: [2x] JobUnavailable: Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:23:22] swfrench-wmf: currently fighting gerrit problems [18:23:41] ack, thanks mutante! [18:23:58] (that's also part of why I'm holding) [18:24:02] great [18:24:11] (03PS1) 10CDanis: benthos webrequest: Add wmfuniq X-Analytics sub-field [puppet] - 10https://gerrit.wikimedia.org/r/1176299 (https://phabricator.wikimedia.org/T400753) [18:24:14] (03PS1) 10CDanis: turnilo: webrequest: add wmfuniq X-Analytics sub-field [puppet] - 10https://gerrit.wikimedia.org/r/1176300 (https://phabricator.wikimedia.org/T400753) [18:24:31] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:29:01] (03CR) 10CI reject: [V:04-1] benthos webrequest: Add wmfuniq X-Analytics sub-field [puppet] - 10https://gerrit.wikimedia.org/r/1176299 (https://phabricator.wikimedia.org/T400753) (owner: 10CDanis) [18:29:31] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:33:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:34:26] (03CR) 10CDanis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1176299 (https://phabricator.wikimedia.org/T400753) (owner: 10CDanis) [18:38:00] swfrench-wmf: for now it seems better [18:38:36] (03PS1) 10CDanis: [WIP] haproxy: silent-drop as early as possible [puppet] - 10https://gerrit.wikimedia.org/r/1176302 [18:42:44] (03PS1) 10Dzahn: Revert "gerrit: add an IP to abusers list" [puppet] - 10https://gerrit.wikimedia.org/r/1176303 [18:42:59] (03CR) 10CI reject: [V:04-1] Revert "gerrit: add an IP to abusers list" [puppet] - 10https://gerrit.wikimedia.org/r/1176303 (owner: 10Dzahn) [18:43:41] 06SRE, 10SRE-SLO, 06Traffic: Page on ATS backend errors relative to traffic - https://phabricator.wikimedia.org/T400675#11066418 (10RLazarus) We talked about this in the SLO meeting today -- one possible approach is to keep `ATSBackendErrorsHigh` as a default policy, but keep a list of services to //exclude/... [18:47:53] mutante: awesome, thank you! [18:51:42] (03PS2) 10Dzahn: Revert "gerrit: add an IP to abusers list" [puppet] - 10https://gerrit.wikimedia.org/r/1176303 [18:52:36] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2245'] [18:52:50] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2245'] [18:53:17] (03PS3) 10Dzahn: Revert "gerrit: add an IP to abusers list" [puppet] - 10https://gerrit.wikimedia.org/r/1176303 [18:54:44] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host db2245.codfw.wmnet with OS bookworm [18:54:51] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11066430 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host db2245.codfw.wmnet with OS bookworm [18:54:51] (03CR) 10Dzahn: [C:03+2] Revert "gerrit: add an IP to abusers list" [puppet] - 10https://gerrit.wikimedia.org/r/1176303 (owner: 10Dzahn) [18:55:30] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host db2246.codfw.wmnet with OS bookworm [18:55:38] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11066431 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host db2246.codfw.wmnet with OS bookworm [18:55:51] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host db2247.codfw.wmnet with OS bookworm [18:56:00] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11066432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host db2247.codfw.wmnet with OS bookworm [19:02:38] alright, I'll be getting started on those infra-related changes shortly [19:08:47] jhancock@cumin1003 reimage (PID 890906) is awaiting input [19:09:31] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:10:23] jhancock@cumin1003 reimage (PID 890975) is awaiting input [19:11:32] !log swfrench@deploy1003 Started scap sync-world: No-op deployment to configure PHP 8.3 image builds - T399884 [19:11:35] T399884: Configure production MediaWiki image builds for PHP 8.3 - https://phabricator.wikimedia.org/T399884 [19:11:39] jhancock@cumin1003 reimage (PID 890989) is awaiting input [19:30:19] !log swfrench@deploy1003 Finished scap sync-world: No-op deployment to configure PHP 8.3 image builds - T399884 (duration: 19m 22s) [19:30:23] T399884: Configure production MediaWiki image builds for PHP 8.3 - https://phabricator.wikimedia.org/T399884 [19:31:27] alright, I'm done with my changes [19:40:32] PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 160519 MB (4% inode=99%): /var/lib/hadoop/data/h 155197 MB (4% inode=99%): /var/lib/hadoop/data/b 159570 MB (4% inode=99%): /var/lib/hadoop/data/k 149609 MB (3% inode=99%): /var/lib/hadoop/data/m 156022 MB (4% inode=99%): /var/lib/hadoop/data/f 161558 MB (4% inode=99%): /var/lib/hadoop/data/j 158060 MB (4% inode=99%): /var/lib/hadoop/data [19:40:32] 5 MB (4% inode=99%): /var/lib/hadoop/data/l 154186 MB (4% inode=99%): /var/lib/hadoop/data/i 157536 MB (4% inode=99%): /var/lib/hadoop/data/g 156551 MB (4% inode=99%): /var/lib/hadoop/data/c 156477 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops [19:41:08] PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - free space: /srv 12143 MB (4% inode=69%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [19:43:14] PROBLEM - Host msw1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [19:43:38] RECOVERY - Host msw1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [19:54:06] !log maintenance goin on on msw1-eqiad [19:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:29] ok doing it [19:55:51] ok doing it [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:05:34] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:06:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:20] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2245.codfw.wmnet with OS bookworm [20:08:59] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2246.codfw.wmnet with OS bookworm [20:09:57] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2247.codfw.wmnet with OS bookworm [20:10:25] (03CR) 10BPirkle: [C:03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175942 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [20:25:49] Is the ES servers for trace.wikimedia.org queryable manually at all? [20:38:32] !log dancy@deploy1003 Installing scap version "4.197.1" for 169 host(s) [20:42:43] !log dancy@deploy1003 Installing scap version "4.197.1" for 1 host(s) [20:43:35] !log dancy@deploy1003 Installation of scap version "4.197.1" completed for 1 hosts [20:53:51] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2015.codfw.wmnet w/ force delete existing files, repooling both afterwards [20:53:55] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [20:59:23] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1014.eqiad.wmnet w/ force delete existing files, repooling both afterwards [20:59:27] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T2100) [21:00:32] PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 151429 MB (4% inode=99%): /var/lib/hadoop/data/h 151208 MB (4% inode=99%): /var/lib/hadoop/data/b 152901 MB (4% inode=99%): /var/lib/hadoop/data/k 145183 MB (3% inode=99%): /var/lib/hadoop/data/m 146753 MB (3% inode=99%): /var/lib/hadoop/data/f 153605 MB (4% inode=99%): /var/lib/hadoop/data/j 150407 MB (4% inode=99%): /var/lib/hadoop/data [21:00:32] 3 MB (4% inode=99%): /var/lib/hadoop/data/l 149050 MB (3% inode=99%): /var/lib/hadoop/data/i 154517 MB (4% inode=99%): /var/lib/hadoop/data/g 145566 MB (3% inode=99%): /var/lib/hadoop/data/c 148083 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops [21:05:34] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:06:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:26] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [21:14:39] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.wdqs.restart (exit_code=97) [21:15:14] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [21:20:32] PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 127122 MB (3% inode=99%): /var/lib/hadoop/data/g 134715 MB (3% inode=99%): /var/lib/hadoop/data/j 134425 MB (3% inode=99%): /var/lib/hadoop/data/c 126480 MB (3% inode=99%): /var/lib/hadoop/data/b 133953 MB (3% inode=99%): /var/lib/hadoop/data/l 136852 MB (3% inode=99%): /var/lib/hadoop/data/k 121830 MB (3% inode=99%): /var/lib/hadoop/data [21:20:32] 4 MB (3% inode=99%): /var/lib/hadoop/data/i 135656 MB (3% inode=99%): /var/lib/hadoop/data/m 135499 MB (3% inode=99%): /var/lib/hadoop/data/d 130019 MB (3% inode=99%): /var/lib/hadoop/data/h 135665 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops [21:34:24] (03CR) 10Eevans: [C:03+1] swift: remove ms-be106[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1176258 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [21:50:55] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2015.codfw.wmnet w/ force delete existing files, repooling both afterwards [21:51:00] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [21:53:10] (03PS28) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [21:57:54] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1014.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:57:58] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [21:59:27] (03CR) 10CDobbins: dnsrecursor: add recursor.yml.erb (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T2200) [22:09:55] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [22:46:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [22:49:37] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1036.mgmt:22 - https://phabricator.wikimedia.org/T401210#11066919 (10VRiley-WMF) 05Open→03Resolved Reseated cable and it seems to have come back online. [22:51:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [22:57:20] (03PS4) 10Cwhite: prometheus::elasticsearch_exporter: make extra_config field optional [puppet] - 10https://gerrit.wikimedia.org/r/1176247 (https://phabricator.wikimedia.org/T401278) [22:57:20] (03CR) 10Cwhite: [C:03+2] "PCC OK: https://puppet-compiler.wmflabs.org/output/1176247/6508/" [puppet] - 10https://gerrit.wikimedia.org/r/1176247 (https://phabricator.wikimedia.org/T401278) (owner: 10Cwhite) [23:09:31] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:12:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [23:17:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [23:19:17] (03PS1) 10Dzahn: admin: add an alias to my own .bash_profile [puppet] - 10https://gerrit.wikimedia.org/r/1176322 [23:38:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176323 [23:38:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176323 (owner: 10TrainBranchBot) [23:42:47] (03PS1) 10Dzahn: jenkins: escape : with \ in sudoers privileges line [puppet] - 10https://gerrit.wikimedia.org/r/1176324 (https://phabricator.wikimedia.org/T400645) [23:43:05] (03CR) 10Dzahn: [C:03+2] "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/1176324" [puppet] - 10https://gerrit.wikimedia.org/r/1176198 (https://phabricator.wikimedia.org/T400645) (owner: 10Jaime Nuche) [23:43:42] (03CR) 10Dzahn: [C:03+2] jenkins: escape : with \ in sudoers privileges line [puppet] - 10https://gerrit.wikimedia.org/r/1176324 (https://phabricator.wikimedia.org/T400645) (owner: 10Dzahn) [23:55:09] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176323 (owner: 10TrainBranchBot)