[00:07:26] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11063102 (10Jhancock.wm) @Papaul this one did the thing about going to the wrong puppet server again. Can you delete it so i can try again later?  [8/10, retrying in 640.00s] Attem...
[00:07:34] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11063103 (10Jhancock.wm)
[00:08:36] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175970
[00:08:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175970 (owner: 10TrainBranchBot)
[00:08:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#11063104 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm thanks @elukey!
[00:14:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P80872 and previous config saved to /var/cache/conftool/dbconfig/20250806-001413-fceratto.json
[00:29:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T399728)', diff saved to https://phabricator.wikimedia.org/P80873 and previous config saved to /var/cache/conftool/dbconfig/20250806-002921-fceratto.json
[00:29:25] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[00:29:37] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[00:45:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175970 (owner: 10TrainBranchBot)
[00:47:49] <wikibugs>	 (03CR) 10Umherirrender: "Known failure: T400950" [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175970 (owner: 10TrainBranchBot)
[00:49:00] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11063161 (10Papaul) @Jhancock.wm done
[01:24:08] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::wmcs::chartmuseum: install cm-push [puppet] - 10https://gerrit.wikimedia.org/r/1175972
[01:24:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] profile::wmcs::chartmuseum: install cm-push [puppet] - 10https://gerrit.wikimedia.org/r/1175972 (owner: 10Andrew Bogott)
[01:25:09] <wikibugs>	 (03PS2) 10Andrew Bogott: profile::wmcs::chartmuseum: install cm-push [puppet] - 10https://gerrit.wikimedia.org/r/1175972
[01:28:02] <wikibugs>	 (03CR) 10Andrew Bogott: [V:03+2 C:03+2] profile::wmcs::chartmuseum: install cm-push [puppet] - 10https://gerrit.wikimedia.org/r/1175972 (owner: 10Andrew Bogott)
[02:11:12] <wikibugs>	 (03CR) 10RLazarus: "Some high-level questions about this, after reading through UpdateConfigs.php and T398422. If you've already talked through all this with " [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx)
[02:24:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11063186 (10Jhancock.wm) 05Open→03Resolved @MatthewVernon we're finished with this test server if you want to run some test on it. It's a 1 CPU version of the config-J ser...
[02:30:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#11063190 (10Jhancock.wm) 05Open→03Resolved @BTullis this is a 1CPU version of the config I servers you use for the an-worker and an-presto servers. It's in codfw so I'm not sure...
[03:02:46] <icinga-wm>	 PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100%
[03:05:20] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[03:09:14] <icinga-wm>	 RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms
[03:09:30] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[03:12:46] <icinga-wm>	 PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:14] <icinga-wm>	 RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms
[03:22:35] <logmsgbot>	 jhancock@cumin1003 provision (PID 782454) is awaiting input
[03:43:21] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[03:43:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#11063213 (10Jhancock.wm) 05Open→03Resolved @Marostegui got you a clean raid10. fyi, it is provisioned as uefi. thanks for your patience!
[05:09:44] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T0600)
[06:12:13] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: haproxy: create the x-provenance map even when not using etcd [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621)
[06:12:14] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish: also convert abuse networks to use x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621)
[06:12:16] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Remove blocked-nets from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1175991
[06:12:16] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish: stop loading netmaps [puppet] - 10https://gerrit.wikimedia.org/r/1175992 (https://phabricator.wikimedia.org/T396621)
[06:12:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxy: create the x-provenance map even when not using etcd [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto)
[06:14:44] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:20:56] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: haproxy: create the x-provenance map even when not using etcd [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621)
[06:20:56] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: varnish: also convert abuse networks to use x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621)
[06:20:56] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Remove blocked-nets from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1175991
[06:20:56] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: varnish: stop loading netmaps [puppet] - 10https://gerrit.wikimedia.org/r/1175992 (https://phabricator.wikimedia.org/T396621)
[06:24:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6502/co" [puppet] - 10https://gerrit.wikimedia.org/r/1175992 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto)
[06:27:09] <wikibugs>	 (03PS1) 10KartikMistry: Enable the Contribute menu in 9th group of Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176040 (https://phabricator.wikimedia.org/T397122)
[06:28:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176040 (https://phabricator.wikimedia.org/T397122) (owner: 10KartikMistry)
[06:35:14] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[06:35:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T399728)', diff saved to https://phabricator.wikimedia.org/P80874 and previous config saved to /var/cache/conftool/dbconfig/20250806-063521-fceratto.json
[06:35:24] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[06:39:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T399728)', diff saved to https://phabricator.wikimedia.org/P80875 and previous config saved to /var/cache/conftool/dbconfig/20250806-063903-fceratto.json
[06:54:11] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P80876 and previous config saved to /var/cache/conftool/dbconfig/20250806-065410-fceratto.json
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T0700).
[07:00:04] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:19] <kart_>	 here
[07:00:28] <kart_>	 I'll deploy myself.
[07:02:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176040 (https://phabricator.wikimedia.org/T397122) (owner: 10KartikMistry)
[07:03:17] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the Contribute menu in 9th group of Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176040 (https://phabricator.wikimedia.org/T397122) (owner: 10KartikMistry)
[07:03:52] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1176040|Enable the Contribute menu in 9th group of Wikipedias (T397122)]]
[07:03:56] <stashbot>	 T397122: Enable the Contribute menu in 9th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T397122
[07:05:54] <logmsgbot>	 !log kartik@deploy1003 kartik: Backport for [[gerrit:1176040|Enable the Contribute menu in 9th group of Wikipedias (T397122)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:08:16] <logmsgbot>	 !log kartik@deploy1003 kartik: Continuing with sync
[07:09:19] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P80877 and previous config saved to /var/cache/conftool/dbconfig/20250806-070918-fceratto.json
[07:09:30] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[07:13:23] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: re-add email [puppet] - 10https://gerrit.wikimedia.org/r/1176095
[07:13:29] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176040|Enable the Contribute menu in 9th group of Wikipedias (T397122)]] (duration: 09m 37s)
[07:13:32] <stashbot>	 T397122: Enable the Contribute menu in 9th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T397122
[07:14:40] <kart_>	 I'm done. No more patches in the window AFAIK.
[07:23:19] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::elasticsearch: Set new extra_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/1176122 (https://phabricator.wikimedia.org/T401278)
[07:24:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T399728)', diff saved to https://phabricator.wikimedia.org/P80878 and previous config saved to /var/cache/conftool/dbconfig/20250806-072425-fceratto.json
[07:24:30] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[07:24:41] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[07:24:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T399728)', diff saved to https://phabricator.wikimedia.org/P80879 and previous config saved to /var/cache/conftool/dbconfig/20250806-072448-fceratto.json
[07:33:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T399728)', diff saved to https://phabricator.wikimedia.org/P80880 and previous config saved to /var/cache/conftool/dbconfig/20250806-073343-fceratto.json
[07:33:48] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[07:35:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Decommission cirrussearch2055-2060 - https://phabricator.wikimedia.org/T395855#11063541 (10brouberol) ` ~ ❯ ssh cirrussearch2055.codfw.wmnet Stdio forwarding request failed: Session open refused by peer Connection closed by UN...
[07:43:31] <wikibugs>	 (03PS1) 10Brouberol: datahub: increasae memory for frontend and mae-consumer pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176187 (https://phabricator.wikimedia.org/T398599)
[07:45:22] <Reedy>	 !log created wikilove tables on thwiki T401279
[07:45:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:26] <stashbot>	 T401279: Extension WikiLove for th.wikipedaia.org - https://phabricator.wikimedia.org/T401279
[07:46:52] <wikibugs>	 (03PS2) 10Brouberol: datahub: increase memory for frontend and mae-consumer pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176187 (https://phabricator.wikimedia.org/T398599)
[07:48:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P80881 and previous config saved to /var/cache/conftool/dbconfig/20250806-074851-fceratto.json
[07:54:00] <wikibugs>	 (03PS5) 10Federico Ceratto: Add MariaDB test-s8 section VMs [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087)
[07:56:33] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1176122 (https://phabricator.wikimedia.org/T401278) (owner: 10Majavah)
[07:56:39] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::elasticsearch: Set new extra_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/1176122 (https://phabricator.wikimedia.org/T401278) (owner: 10Majavah)
[07:57:05] <wikibugs>	 (03PS1) 10Chlod Alejandro: thwiki: enable WikiLove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176189 (https://phabricator.wikimedia.org/T401279)
[07:59:16] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6503/console" [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto)
[07:59:31] <wikibugs>	 (03CR) 10Reedy: [C:03+2] thwiki: enable WikiLove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176189 (https://phabricator.wikimedia.org/T401279) (owner: 10Chlod Alejandro)
[08:00:01] <hashar>	 `%{message}` ahh
[08:00:05] <jouncebot>	 hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T0800)
[08:00:05] <hashar>	 things never changes :]
[08:00:21] <Reedy>	 hashar: hackathon says hello
[08:00:23] <hashar>	 that is good old josnTruncated messages
[08:00:25] <wikibugs>	 (03Merged) 10jenkins-bot: thwiki: enable WikiLove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176189 (https://phabricator.wikimedia.org/T401279) (owner: 10Chlod Alejandro)
[08:00:55] <hashar>	 Reedy: hi hackathon!  Please please waves your hands shouting "TRAIN IS ROLLING NOW!"
[08:00:56] <hashar>	 :)
[08:01:05] <logmsgbot>	 !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1176189|thwiki: enable WikiLove (T401279)]]
[08:01:09] <stashbot>	 T401279: Extension WikiLove for th.wikipedia.org - https://phabricator.wikimedia.org/T401279
[08:01:13] <hashar>	 I wanna check that jsontruncated message though
[08:01:30] <hashar>	 ","message":"AbuseFilter parser error: ID: regexfailure; position: 148; params: ...
[08:01:37] * hashar files a task
[08:03:02] <logmsgbot>	 !log reedy@deploy1003 reedy, chlod: Backport for [[gerrit:1176189|thwiki: enable WikiLove (T401279)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:03:49] <logmsgbot>	 !log reedy@deploy1003 reedy, chlod: Continuing with sync
[08:03:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P80882 and previous config saved to /var/cache/conftool/dbconfig/20250806-080359-fceratto.json
[08:08:14] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+1] haproxy: create the x-provenance map even when not using etcd [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto)
[08:08:51] <logmsgbot>	 !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176189|thwiki: enable WikiLove (T401279)]] (duration: 07m 46s)
[08:08:54] <stashbot>	 T401279: Extension WikiLove for th.wikipedia.org - https://phabricator.wikimedia.org/T401279
[08:09:38] <hashar>	 https://phabricator.wikimedia.org/T401285
[08:09:46] <hashar>	 AbuseFilter parser error: ID: regexfailure; position: 148; params: /{{short description|American politician}}\\n{{Infobox officeholder \\n| name         = Anthony Frontzak\\n|image        = Pat Toomey, Official Portrait, 112th Congress.jpg\\n|....
[08:10:01] <wikibugs>	 (03PS1) 10Brouberol: airflow: add kafka-main-{eqiad,codfw}-external to the common connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176190 (https://phabricator.wikimedia.org/T372912)
[08:13:19] <wikibugs>	 (03CR) 10MVernon: "The Phab task talks about 3 hosts per DC, which would be 6, but you have only 5 here. Is that intentional?" [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087) (owner: 10Federico Ceratto)
[08:13:23] <wikibugs>	 (03CR) 10DCausse: [C:03+1] airflow: add kafka-main-{eqiad,codfw}-external to the common connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176190 (https://phabricator.wikimedia.org/T372912) (owner: 10Brouberol)
[08:13:59] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: add kafka-main-{eqiad,codfw}-external to the common connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176190 (https://phabricator.wikimedia.org/T372912) (owner: 10Brouberol)
[08:15:43] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[08:16:27] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply
[08:17:52] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176191 (https://phabricator.wikimedia.org/T396374)
[08:17:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176191 (https://phabricator.wikimedia.org/T396374) (owner: 10TrainBranchBot)
[08:18:43] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176191 (https://phabricator.wikimedia.org/T396374) (owner: 10TrainBranchBot)
[08:19:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T399728)', diff saved to https://phabricator.wikimedia.org/P80883 and previous config saved to /var/cache/conftool/dbconfig/20250806-081906-fceratto.json
[08:19:10] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[08:19:22] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[08:19:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T399728)', diff saved to https://phabricator.wikimedia.org/P80884 and previous config saved to /var/cache/conftool/dbconfig/20250806-081929-fceratto.json
[08:20:04] <wikibugs>	 (03PS1) 10Chlod Alejandro: thwiki: add WT namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176192 (https://phabricator.wikimedia.org/T401287)
[08:23:11] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T399728)', diff saved to https://phabricator.wikimedia.org/P80885 and previous config saved to /var/cache/conftool/dbconfig/20250806-082311-fceratto.json
[08:25:55] <logmsgbot>	 !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.13  refs T396374
[08:25:59] <stashbot>	 T396374: 1.45.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T396374
[08:26:02] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "varnish upload tests are happy: `0 tests failed, 0 tests skipped, 19 tests passed`" [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto)
[08:27:51] <wikibugs>	 (03CR) 10Vgutierrez: Remove blocked-nets from varnish (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (owner: 10Giuseppe Lavagetto)
[08:33:02] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Alertmanager: add receiver and routing for experiment-platform tasks [puppet] - 10https://gerrit.wikimedia.org/r/1175928 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming)
[08:33:19] <wikibugs>	 (03CR) 10Clément Goubert: mw::maintenance: ExperimentationLab periodic job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx)
[08:34:10] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.rename from snapshot1016 to dse-k8s-worker1019
[08:36:12] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] Alertmanager: add receiver and routing for experiment-platform tasks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175928 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming)
[08:38:19] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P80886 and previous config saved to /var/cache/conftool/dbconfig/20250806-083818-fceratto.json
[08:39:18] <wikibugs>	 (03PS5) 10Clément Goubert: mw::maintenance: ExperimentationLab periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx)
[08:39:20] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx)
[08:39:26] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from snapshot1016 to dse-k8s-worker1019
[08:40:02] <wikibugs>	 (03CR) 10Federico Ceratto: "Replied to a question" [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087) (owner: 10Federico Ceratto)
[08:42:58] <wikibugs>	 (03PS11) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127)
[08:43:10] <wikibugs>	 (03CR) 10Anzx: "minor changes to add task id to appropriate place" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176192 (https://phabricator.wikimedia.org/T401287) (owner: 10Chlod Alejandro)
[08:43:42] <wikibugs>	 (03CR) 10MVernon: [C:03+1] Add MariaDB test-s8 section VMs [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087) (owner: 10Federico Ceratto)
[08:44:26] <wikibugs>	 (03CR) 10MVernon: "Hi @ltoscano@wikimedia.org is this a more helpful comment?" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon)
[08:45:06] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.rename from snapshot1016 to dse-k8s-worker1019
[08:48:15] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.netbox
[08:49:44] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11063859 (10MatthewVernon)
[08:52:10] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1016 to dse-k8s-worker1019 - btullis@cumin1003"
[08:52:29] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1016 to dse-k8s-worker1019 - btullis@cumin1003"
[08:52:29] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:52:30] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-worker1019 on all recursors
[08:52:33] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker1019 on all recursors
[08:52:33] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1019
[08:53:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P80887 and previous config saved to /var/cache/conftool/dbconfig/20250806-085326-fceratto.json
[08:54:35] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1019
[08:55:14] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from snapshot1016 to dse-k8s-worker1019
[08:56:52] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[08:59:55] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy revertrisk-language-agnostic latest published image on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175888 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis)
[09:01:32] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Deploy revertrisk-language-agnostic latest published image on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175888 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis)
[09:07:24] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[09:07:42] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[09:08:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T399728)', diff saved to https://phabricator.wikimedia.org/P80888 and previous config saved to /var/cache/conftool/dbconfig/20250806-090833-fceratto.json
[09:08:37] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[09:08:49] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1195.eqiad.wmnet with reason: Maintenance
[09:08:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T399728)', diff saved to https://phabricator.wikimedia.org/P80889 and previous config saved to /var/cache/conftool/dbconfig/20250806-090856-fceratto.json
[09:12:06] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1015.eqiad.wmnet with reason: host reimage
[09:12:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T399728)', diff saved to https://phabricator.wikimedia.org/P80890 and previous config saved to /var/cache/conftool/dbconfig/20250806-091235-fceratto.json
[09:13:25] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11063973 (10Joe)
[09:15:54] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1016.eqiad.wmnet with OS bookworm
[09:18:00] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm
[09:18:46] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1015.eqiad.wmnet with reason: host reimage
[09:19:17] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1018.eqiad.wmnet with OS bookworm
[09:20:33] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1019.eqiad.wmnet with OS bookworm
[09:22:14] <wikibugs>	 10SRE-SLO, 10EditCheck, 10Lift-Wing, 06Machine-Learning-Team, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11064019 (10gkyziridis) Hey @elukey thnx for sharing this issue. I have a question: Is this issue blocking the A/B testi...
[09:22:22] <wikibugs>	 (03CR) 10Jelto: [C:03+2] add more providers to fetch_external_clouds:vendors_nets.py [puppet] - 10https://gerrit.wikimedia.org/r/1175781 (https://phabricator.wikimedia.org/T401003) (owner: 10Jelto)
[09:26:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:27:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P80891 and previous config saved to /var/cache/conftool/dbconfig/20250806-092743-fceratto.json
[09:29:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175919 (https://phabricator.wikimedia.org/T400118) (owner: 10Sergio Gimeno)
[09:31:09] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1016.eqiad.wmnet with reason: host reimage
[09:33:03] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1017.eqiad.wmnet with reason: host reimage
[09:34:15] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1016.eqiad.wmnet with reason: host reimage
[09:34:16] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1018.eqiad.wmnet with reason: host reimage
[09:35:37] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[09:36:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:38:20] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1018.eqiad.wmnet with reason: host reimage
[09:38:33] <wikibugs>	 (03CR) 10Jelto: [C:04-1] "I'd prefer to do that in requestctl. It's already quite complex to troubleshoot why certain request got blocked, so having that in one pla" [puppet] - 10https://gerrit.wikimedia.org/r/1175933 (owner: 10Dzahn)
[09:38:38] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1019.eqiad.wmnet with reason: host reimage
[09:41:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: nginx.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:41:49] <wikibugs>	 (03PS1) 10Hashar: ExperimentManager: Fix #getExperiment() when uninitialized [extensions/MetricsPlatform] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176195 (https://phabricator.wikimedia.org/T401294)
[09:41:57] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1017.eqiad.wmnet with reason: host reimage
[09:42:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176195 (https://phabricator.wikimedia.org/T401294) (owner: 10Hashar)
[09:42:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P80892 and previous config saved to /var/cache/conftool/dbconfig/20250806-094250-fceratto.json
[09:43:58] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Remove last references to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1175903 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis)
[09:45:07] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1019.eqiad.wmnet with reason: host reimage
[09:45:30] <wikibugs>	 (03Merged) 10jenkins-bot: ExperimentManager: Fix #getExperiment() when uninitialized [extensions/MetricsPlatform] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176195 (https://phabricator.wikimedia.org/T401294) (owner: 10Hashar)
[09:45:55] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1176195|ExperimentManager: Fix #getExperiment() when uninitialized (T401294)]]
[09:45:58] <stashbot>	 T401294: PHP Warning: Undefined array key "active_experiments" - https://phabricator.wikimedia.org/T401294
[09:46:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: nginx.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:47:44] <logmsgbot>	 !log hashar@deploy1003 hashar: Backport for [[gerrit:1176195|ExperimentManager: Fix #getExperiment() when uninitialized (T401294)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:48:56] <logmsgbot>	 !log hashar@deploy1003 hashar: Continuing with sync
[09:50:39] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1016.eqiad.wmnet with OS bookworm
[09:54:15] <logmsgbot>	 !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176195|ExperimentManager: Fix #getExperiment() when uninitialized (T401294)]] (duration: 08m 20s)
[09:54:18] <stashbot>	 T401294: PHP Warning: Undefined array key "active_experiments" - https://phabricator.wikimedia.org/T401294
[09:54:57] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1018.eqiad.wmnet with OS bookworm
[09:57:58] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T399728)', diff saved to https://phabricator.wikimedia.org/P80893 and previous config saved to /var/cache/conftool/dbconfig/20250806-095758-fceratto.json
[09:58:02] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[09:58:15] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[09:58:33] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[09:58:40] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T399728)', diff saved to https://phabricator.wikimedia.org/P80894 and previous config saved to /var/cache/conftool/dbconfig/20250806-095839-fceratto.json
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1000)
[10:00:23] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm
[10:02:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T399728)', diff saved to https://phabricator.wikimedia.org/P80895 and previous config saved to /var/cache/conftool/dbconfig/20250806-100220-fceratto.json
[10:03:53] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1019.eqiad.wmnet with OS bookworm
[10:17:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P80896 and previous config saved to /var/cache/conftool/dbconfig/20250806-101728-fceratto.json
[10:23:49] <wikibugs>	 (03CR) 10Urbanecm: "Lifting my -2, as CommunityConfigurationExample now has the latest two deployment branches." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173924 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm)
[10:25:15] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300 (10OSleger-WMF) 03NEW
[10:26:40] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update flink-operator helm chart to match the upstream release v1.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173407 (https://phabricator.wikimedia.org/T398162) (owner: 10Btullis)
[10:32:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P80897 and previous config saved to /var/cache/conftool/dbconfig/20250806-103235-fceratto.json
[10:33:32] <wikibugs>	 (03Merged) 10jenkins-bot: Update flink-operator helm chart to match the upstream release v1.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173407 (https://phabricator.wikimedia.org/T398162) (owner: 10Btullis)
[10:39:43] <wikibugs>	 (03PS1) 10Jaime Nuche: releases-jenkins: add dpkg options to jenkins package installation [puppet] - 10https://gerrit.wikimedia.org/r/1176198 (https://phabricator.wikimedia.org/T400645)
[10:47:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T399728)', diff saved to https://phabricator.wikimedia.org/P80898 and previous config saved to /var/cache/conftool/dbconfig/20250806-104743-fceratto.json
[10:47:47] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[10:47:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:47:58] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1206.eqiad.wmnet with reason: Maintenance
[10:48:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T399728)', diff saved to https://phabricator.wikimedia.org/P80899 and previous config saved to /var/cache/conftool/dbconfig/20250806-104805-fceratto.json
[10:48:58] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: Deploy revertrisk-language-agnostic latest published image on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176199 (https://phabricator.wikimedia.org/T400266)
[10:50:48] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T399728)', diff saved to https://phabricator.wikimedia.org/P80900 and previous config saved to /var/cache/conftool/dbconfig/20250806-105047-fceratto.json
[10:50:50] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+1] "Thank you for the very swift help, LGTM <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176199 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis)
[10:53:08] <wikibugs>	 (03CR) 10Ozge: [C:03+1] ml-services: Deploy revertrisk-language-agnostic latest published image on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176199 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis)
[10:54:14] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy revertrisk-language-agnostic latest published image on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176199 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis)
[10:55:54] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Deploy revertrisk-language-agnostic latest published image on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176199 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis)
[10:58:30] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[10:58:45] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[11:00:05] <jouncebot>	 mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1100). nyaa~
[11:00:50] <logmsgbot>	 !log btullis@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[11:02:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:05:55] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P80901 and previous config saved to /var/cache/conftool/dbconfig/20250806-110555-fceratto.json
[11:06:07] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to <wmde and nda>for <sadiyamohammed13> - https://phabricator.wikimedia.org/T401118#11064490 (10WMDECyn) confirming this  request from WMDE side
[11:09:30] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[11:09:41] <logmsgbot>	 !log btullis@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[11:14:03] <logmsgbot>	 !log btullis@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:15:18] <logmsgbot>	 !log btullis@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:21:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P80902 and previous config saved to /var/cache/conftool/dbconfig/20250806-112102-fceratto.json
[11:22:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:27:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:29:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[11:36:10] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T399728)', diff saved to https://phabricator.wikimedia.org/P80903 and previous config saved to /var/cache/conftool/dbconfig/20250806-113609-fceratto.json
[11:36:14] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[11:36:26] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1218.eqiad.wmnet with reason: Maintenance
[11:36:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T399728)', diff saved to https://phabricator.wikimedia.org/P80904 and previous config saved to /var/cache/conftool/dbconfig/20250806-113633-fceratto.json
[11:37:36] <wikibugs>	 (03PS1) 10Effie Mouzeli: kube-state-metrics: collect metrics for metadata.labels.username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209
[11:37:54] <wikibugs>	 (03PS2) 10Effie Mouzeli: kube-state-metrics: collect metrics for metadata.labels.username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209
[11:39:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[11:44:38] <logmsgbot>	 !log btullis@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: sync
[11:44:40] <logmsgbot>	 !log btullis@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: sync
[11:44:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[11:52:17] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T399728)', diff saved to https://phabricator.wikimedia.org/P80905 and previous config saved to /var/cache/conftool/dbconfig/20250806-115216-fceratto.json
[11:52:20] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[11:53:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:03:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:07:24] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P80906 and previous config saved to /var/cache/conftool/dbconfig/20250806-120723-fceratto.json
[12:11:48] <logmsgbot>	 !log btullis@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[12:13:57] <logmsgbot>	 !log btullis@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[12:16:10] <wikibugs>	 (03CR) 10Elukey: "It is yes! I'd personally replace "$1 and $2 are the values captured in the two groups in parentheses in $jbod_re" with an example of befo" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon)
[12:16:22] <wikibugs>	 (03CR) 10Phuedx: [C:03+2] xLab: Deploy v0.8.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175961 (https://phabricator.wikimedia.org/T384107) (owner: 10Santiago Faci)
[12:17:14] <logmsgbot>	 !log btullis@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[12:17:51] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploy v0.8.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175961 (https://phabricator.wikimedia.org/T384107) (owner: 10Santiago Faci)
[12:17:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:17:55] <wikibugs>	 (03PS2) 10Reedy: thwiki: add WT namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176192 (https://phabricator.wikimedia.org/T401287) (owner: 10Chlod Alejandro)
[12:18:07] <wikibugs>	 (03CR) 10Reedy: [C:03+2] thwiki: add WT namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176192 (https://phabricator.wikimedia.org/T401287) (owner: 10Chlod Alejandro)
[12:18:22] <wikibugs>	 (03CR) 10Reedy: [C:03+2] "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176192 (https://phabricator.wikimedia.org/T401287) (owner: 10Chlod Alejandro)
[12:18:36] <logmsgbot>	 !log btullis@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[12:18:59] <wikibugs>	 (03Merged) 10jenkins-bot: thwiki: add WT namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176192 (https://phabricator.wikimedia.org/T401287) (owner: 10Chlod Alejandro)
[12:20:35] <wikibugs>	 10SRE-SLO, 10EditCheck, 10Lift-Wing, 06Machine-Learning-Team, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11064788 (10elukey) Hey @gkyziridis, nono this is something related to the SLO itself, we'll need to review the targets...
[12:22:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P80907 and previous config saved to /var/cache/conftool/dbconfig/20250806-122231-fceratto.json
[12:22:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:23:22] <logmsgbot>	 !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1176192|thwiki: add WT namespace alias (T401287)]]
[12:23:25] <stashbot>	 T401287:   "WT" namespace alias for th.wikipedia.org - https://phabricator.wikimedia.org/T401287
[12:25:14] <logmsgbot>	 !log reedy@deploy1003 chlod, reedy: Backport for [[gerrit:1176192|thwiki: add WT namespace alias (T401287)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:25:55] <wikibugs>	 10ops-codfw, 06DC-Ops: Add scs-e3-codfw to monitoring - https://phabricator.wikimedia.org/T401310 (10ayounsi) 03NEW
[12:26:44] <logmsgbot>	 !log reedy@deploy1003 chlod, reedy: Continuing with sync
[12:26:59] <wikibugs>	 (03PS1) 10Ayounsi: Rancid: add SR-Linux support [puppet] - 10https://gerrit.wikimedia.org/r/1176216
[12:27:01] <wikibugs>	 (03PS12) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127)
[12:27:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Rancid: add SR-Linux support [puppet] - 10https://gerrit.wikimedia.org/r/1176216 (owner: 10Ayounsi)
[12:27:49] <wikibugs>	 10ops-eqiad, 06DC-Ops: Investigate whether we can add RAM to dumpsdata100[4-7] from any decommissioned hosts - https://phabricator.wikimedia.org/T401299#11064837 (10BTullis)
[12:28:09] <wikibugs>	 (03CR) 10MVernon: "How about this? :)" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon)
[12:29:25] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm
[12:31:39] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.dns.netbox
[12:32:18] <logmsgbot>	 !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176192|thwiki: add WT namespace alias (T401287)]] (duration: 08m 56s)
[12:32:21] <stashbot>	 T401287:   "WT" namespace alias for th.wikipedia.org - https://phabricator.wikimedia.org/T401287
[12:32:33] <wikibugs>	 (03PS2) 10Ayounsi: Rancid: add SR-Linux support [puppet] - 10https://gerrit.wikimedia.org/r/1176216
[12:33:12] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1176216 (owner: 10Ayounsi)
[12:33:24] <Reedy>	 !log run namespaceDupes.php on thwiki T401287
[12:33:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:03] <wikibugs>	 (03PS1) 10Brouberol: Provision dse-k8s-worker1015 [puppet] - 10https://gerrit.wikimedia.org/r/1176218 (https://phabricator.wikimedia.org/T398438)
[12:34:05] <wikibugs>	 (03PS1) 10Brouberol: Provision dse-k8s-worker1016 [puppet] - 10https://gerrit.wikimedia.org/r/1176219 (https://phabricator.wikimedia.org/T398438)
[12:34:07] <wikibugs>	 (03PS1) 10Brouberol: Provision dse-k8s-worker1017 [puppet] - 10https://gerrit.wikimedia.org/r/1176220 (https://phabricator.wikimedia.org/T398438)
[12:34:09] <wikibugs>	 (03PS1) 10Brouberol: Provision dse-k8s-worker1018 [puppet] - 10https://gerrit.wikimedia.org/r/1176221 (https://phabricator.wikimedia.org/T398438)
[12:34:11] <wikibugs>	 (03PS1) 10Brouberol: Provision dse-k8s-worker1019 [puppet] - 10https://gerrit.wikimedia.org/r/1176222 (https://phabricator.wikimedia.org/T398438)
[12:35:00] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:35:17] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2245 to codfw - jhancock@cumin1003"
[12:35:21] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2245 to codfw - jhancock@cumin1003"
[12:35:21] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:35:32] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db2245
[12:35:33] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db2246
[12:35:34] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db2247
[12:35:35] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db2248
[12:35:41] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2245
[12:35:42] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2246
[12:35:45] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2247
[12:35:46] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2248
[12:36:15] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2245.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:36:41] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2246.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:37:11] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2247.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:37:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T399728)', diff saved to https://phabricator.wikimedia.org/P80908 and previous config saved to /var/cache/conftool/dbconfig/20250806-123738-fceratto.json
[12:37:42] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[12:37:44] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[12:37:47] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2248.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:37:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T399728)', diff saved to https://phabricator.wikimedia.org/P80909 and previous config saved to /var/cache/conftool/dbconfig/20250806-123751-fceratto.json
[12:39:59] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2245.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:41:36] <logmsgbot>	 !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply
[12:41:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T399728)', diff saved to https://phabricator.wikimedia.org/P80910 and previous config saved to /var/cache/conftool/dbconfig/20250806-124140-fceratto.json
[12:42:13] <logmsgbot>	 !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply
[12:43:56] <wikibugs>	 (03PS1) 10Chlod Alejandro: Add maintenance script to recapitalize 'Nuke' tags [extensions/Nuke] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1176225 (https://phabricator.wikimedia.org/T381598)
[12:46:29] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Add maintenance script to recapitalize 'Nuke' tags [extensions/Nuke] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1176225 (https://phabricator.wikimedia.org/T381598) (owner: 10Chlod Alejandro)
[12:49:41] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bookworm
[12:49:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:49:55] <logmsgbot>	 jhancock@cumin1003 provision (PID 848282) is awaiting input
[12:51:27] <wikibugs>	 (03CR) 10Elukey: [C:03+1] swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon)
[12:51:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065008 (10SLopes-WMF) As Otto's manager, I approve this request.
[12:53:17] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2246.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:53:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11065013 (10elukey) To keep archives happy, late_command.sh fails. Reporting what I wrote on IRC to the Traffic team:  ` All right back testing late_command on...
[12:54:30] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon)
[12:54:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:55:19] <wikibugs>	 (03Merged) 10jenkins-bot: Add maintenance script to recapitalize 'Nuke' tags [extensions/Nuke] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1176225 (https://phabricator.wikimedia.org/T381598) (owner: 10Chlod Alejandro)
[12:55:20] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2247.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:56:40] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1091.eqiad.wmnet with OS bullseye
[12:56:48] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P80911 and previous config saved to /var/cache/conftool/dbconfig/20250806-125648-fceratto.json
[12:56:52] <logmsgbot>	 !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1176225|Add maintenance script to recapitalize 'Nuke' tags (T381598)]]
[12:56:53] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127#11065029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1091.eqiad.wmnet with OS bullseye
[12:56:55] <stashbot>	 T381598: Create and run a maintenance script to rename incorrectly capitalised Nuke-tagged log entries [2HRS] - https://phabricator.wikimedia.org/T381598
[12:57:09] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Provision dse-k8s-worker1015 [puppet] - 10https://gerrit.wikimedia.org/r/1176218 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol)
[12:57:29] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Provision dse-k8s-worker1016 [puppet] - 10https://gerrit.wikimedia.org/r/1176219 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol)
[12:57:37] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bullseye
[12:57:41] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Provision dse-k8s-worker1017 [puppet] - 10https://gerrit.wikimedia.org/r/1176220 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol)
[12:57:56] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127#11065034 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2088.codfw.wmnet with OS bullseye
[12:58:00] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Provision dse-k8s-worker1018 [puppet] - 10https://gerrit.wikimedia.org/r/1176221 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol)
[12:58:15] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2248.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:58:23] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Provision dse-k8s-worker1019 [puppet] - 10https://gerrit.wikimedia.org/r/1176222 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol)
[12:58:35] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11065045 (10Jhancock.wm) a:05Marostegui→03Jhancock.wm
[12:58:44] <logmsgbot>	 !log reedy@deploy1003 chlod, reedy: Backport for [[gerrit:1176225|Add maintenance script to recapitalize 'Nuke' tags (T381598)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:59:17] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:59:48] <logmsgbot>	 !log reedy@deploy1003 chlod, reedy: Continuing with sync
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1300). Please do the needful.
[13:00:05] <jouncebot>	 sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:22] <Lucas_WMDE>	 o/
[13:00:42] <sergi0>	 o/
[13:00:49] <Lucas_WMDE>	 want to self-service your beta change? ^^
[13:00:55] <sergi0>	 Sure
[13:01:48] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11065069 (10Jhancock.wm) db2245 didn't pass provision. will investigate.
[13:02:42] <sergi0>	 Oh, it seems @Reedy has locked backporting, I'll come back in 10min
[13:03:36] <physikerwelt>	 Hi, I am seeing `14:55:59 npm warn tar TAR_ENTRY_ERROR ENOSPC: no space left on device, write` errors in Jenkins jobs. See for example https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/32039/console
[13:04:13] <Lucas_WMDE>	 physikerwelt: #wikimedia-releng is probably the better channel for that IIUC
[13:04:17] <taavi>	 yes
[13:04:22] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host dbprov2007.codfw.wmnet with OS bookworm
[13:04:25] <Lucas_WMDE>	 also, the latest message by wmf-insecte in there suggests the disk space got freed up again
[13:04:33] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11065095 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host dbprov2007.codfw.wmnet with OS bookworm
[13:04:40] <Lucas_WMDE>	 was 98%, now back to 42%
[13:04:58] <physikerwelt>	 Lucas_WMDE: sorry, thank you.
[13:05:11] <logmsgbot>	 !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176225|Add maintenance script to recapitalize 'Nuke' tags (T381598)]] (duration: 08m 18s)
[13:05:14] <stashbot>	 T381598: Create and run a maintenance script to rename incorrectly capitalised Nuke-tagged log entries [2HRS] - https://phabricator.wikimedia.org/T381598
[13:05:50] <brouberol>	 !log committing new homer config to add dse-k8s-worker101[5-9] to the bgp groups
[13:05:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:31] <Lucas_WMDE>	 Reedy: can sergi0 deploy or do you need something else backported? (I assume running the maint script shouldn’t conflict with another deployment)
[13:08:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Provision dse-k8s-worker1015 [puppet] - 10https://gerrit.wikimedia.org/r/1176218 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol)
[13:08:58] <wikibugs>	 (03PS1) 10Genoveva Galarza: wikifunctions: Upgrade orchestrator from 2025-07-29-155618 to 2025-08-01-154925 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176230 (https://phabricator.wikimedia.org/T351458)
[13:08:59] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1091.eqiad.wmnet with reason: host reimage
[13:09:17] <sergi0>	 scap says is unlocked now so I'm going ahead
[13:10:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175919 (https://phabricator.wikimedia.org/T400118) (owner: 10Sergio Gimeno)
[13:11:11] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] beta: enable new leveling up notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175919 (https://phabricator.wikimedia.org/T400118) (owner: 10Sergio Gimeno)
[13:11:37] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage
[13:11:44] <Reedy>	 !log ran `foreachwiki extensions/Nuke/maintenance/normalizeNukeTags.php` T381598
[13:11:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:47] <stashbot>	 T381598: Create and run a maintenance script to rename incorrectly capitalised Nuke-tagged log entries [2HRS] - https://phabricator.wikimedia.org/T381598
[13:11:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P80912 and previous config saved to /var/cache/conftool/dbconfig/20250806-131155-fceratto.json
[13:12:23] <sergi0>	 done
[13:14:19] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1091.eqiad.wmnet with reason: host reimage
[13:14:23] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:14:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:54] <wikibugs>	 (03CR) 10Dr0ptp4kt: "Thanks @rlazarus@wikimedia.org! Best if @phuedx@wikimedia.org chimes in (he's tech lead on the Experimentation Lab ("xLab" for short - I m" [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx)
[13:18:28] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2012.codfw.wmnet w/ force delete existing files, repooling both afterwards
[13:18:31] <stashbot>	 T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098
[13:19:01] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage
[13:20:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1011.eqiad.wmnet w/ force delete existing files, repooling both afterwards
[13:21:33] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Provision dse-k8s-worker1016 [puppet] - 10https://gerrit.wikimedia.org/r/1176219 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol)
[13:21:38] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2007.codfw.wmnet with reason: host reimage
[13:21:40] <wikibugs>	 (03PS2) 10Brouberol: Provision dse-k8s-worker1016 [puppet] - 10https://gerrit.wikimedia.org/r/1176219 (https://phabricator.wikimedia.org/T398438)
[13:22:34] <wikibugs>	 (03PS1) 10Genoveva Galarza: wikifunctions: Upgrade evaluators from 2025-07-30-130544 to 2025-08-05-075031 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176234 (https://phabricator.wikimedia.org/T386794)
[13:25:09] <wikibugs>	 (03PS3) 10Brouberol: Provision dse-k8s-worker1016 [puppet] - 10https://gerrit.wikimedia.org/r/1176219 (https://phabricator.wikimedia.org/T398438)
[13:25:09] <wikibugs>	 (03PS2) 10Brouberol: Provision dse-k8s-worker1017 [puppet] - 10https://gerrit.wikimedia.org/r/1176220 (https://phabricator.wikimedia.org/T398438)
[13:25:09] <wikibugs>	 (03PS2) 10Brouberol: Provision dse-k8s-worker1018 [puppet] - 10https://gerrit.wikimedia.org/r/1176221 (https://phabricator.wikimedia.org/T398438)
[13:25:09] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2007.codfw.wmnet with reason: host reimage
[13:25:10] <wikibugs>	 (03PS2) 10Brouberol: Provision dse-k8s-worker1019 [puppet] - 10https://gerrit.wikimedia.org/r/1176222 (https://phabricator.wikimedia.org/T398438)
[13:25:23] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: disk sdj failure for cloudcephosd1013.eqiad.wmnet - https://phabricator.wikimedia.org/T401319 (10fnegri) 03NEW
[13:26:01] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: disk sdj failure for cloudcephosd1013.eqiad.wmnet - https://phabricator.wikimedia.org/T401319#11065222 (10fnegri)
[13:27:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T399728)', diff saved to https://phabricator.wikimedia.org/P80913 and previous config saved to /var/cache/conftool/dbconfig/20250806-132703-fceratto.json
[13:27:07] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[13:27:18] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1232.eqiad.wmnet with reason: Maintenance
[13:27:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T399728)', diff saved to https://phabricator.wikimedia.org/P80914 and previous config saved to /var/cache/conftool/dbconfig/20250806-132725-fceratto.json
[13:27:35] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Provision dse-k8s-worker1016 [puppet] - 10https://gerrit.wikimedia.org/r/1176219 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol)
[13:27:37] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Provision dse-k8s-worker1017 [puppet] - 10https://gerrit.wikimedia.org/r/1176220 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol)
[13:27:40] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Provision dse-k8s-worker1018 [puppet] - 10https://gerrit.wikimedia.org/r/1176221 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol)
[13:27:47] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Provision dse-k8s-worker1019 [puppet] - 10https://gerrit.wikimedia.org/r/1176222 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol)
[13:29:26] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1091.eqiad.wmnet with OS bullseye
[13:29:47] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127#11065253 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1091.eqiad.wmnet with OS bullseye completed: - ms-be1...
[13:31:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T399728)', diff saved to https://phabricator.wikimedia.org/P80915 and previous config saved to /var/cache/conftool/dbconfig/20250806-133115-fceratto.json
[13:36:01] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2088.codfw.wmnet with OS bullseye
[13:36:14] <wikibugs>	 06SRE, 10SRE-swift-storage: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127#11065287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2088.codfw.wmnet with OS bullseye completed: - ms-be2088 (**PASS**)   - Dow...
[13:37:04] <wikibugs>	 (03PS1) 10Brouberol: site: assign dse_k8s::worker role to dse-k8s-worker1019 [puppet] - 10https://gerrit.wikimedia.org/r/1176243 (https://phabricator.wikimedia.org/T398438)
[13:39:18] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] site: assign dse_k8s::worker role to dse-k8s-worker1019 [puppet] - 10https://gerrit.wikimedia.org/r/1176243 (https://phabricator.wikimedia.org/T398438) (owner: 10Brouberol)
[13:40:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065306 (10ayounsi)
[13:46:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P80916 and previous config saved to /var/cache/conftool/dbconfig/20250806-134623-fceratto.json
[13:47:09] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065311 (10ayounsi) @ssastry hello, we also need your approval to add @OSleger-WMF to `parsoid-admin`  (cf. https://gerrit.wikimedia.org/r/plugins/gitiles/operation...
[13:47:11] <wikibugs>	 (03PS1) 10Jelto: gitlab: adjust nftables throttling thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1176246 (https://phabricator.wikimedia.org/T400971)
[13:47:20] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065313 (10ayounsi)
[13:48:01] <wikibugs>	 (03PS2) 10Jelto: gitlab: adjust nftables throttling thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1176246 (https://phabricator.wikimedia.org/T400971)
[13:49:12] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[13:49:16] <wikibugs>	 (03PS1) 10Cwhite: prometheus: make extra_config field optional [puppet] - 10https://gerrit.wikimedia.org/r/1176247
[13:50:16] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[13:50:17] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov2007.codfw.wmnet with OS bookworm
[13:50:28] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11065317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host dbprov2007.codfw.wmnet with OS bookworm completed: - dbprov2007 (**WARN*...
[13:51:34] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6506/co" [puppet] - 10https://gerrit.wikimedia.org/r/1176246 (https://phabricator.wikimedia.org/T400971) (owner: 10Jelto)
[13:53:55] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11065323 (10Jhancock.wm)
[13:54:17] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11065324 (10Jhancock.wm) 05Open→03Resolved @jcrespo this is complete
[13:59:14] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065367 (10ayounsi)
[13:59:43] <wikibugs>	 (03PS3) 10Hashar: build: upgrade QUnit [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1175475
[14:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1400)
[14:00:08] <wikibugs>	 (03PS1) 10Effie Mouzeli: profile::hcaptcha::proxy: config improvements [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841)
[14:01:08] <wikibugs>	 (03CR) 10Hashar: "I have made some code adjustment after `QUnit.test.each` learned to output nice labels when being fed an array ( https://github.com/qunitj" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1175475 (owner: 10Hashar)
[14:01:26] <wikibugs>	 (03PS2) 10Effie Mouzeli: profile::hcaptcha::proxy: config improvements [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841)
[14:01:30] <wikibugs>	 (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-07-29-155618 to 2025-08-01-154925 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176230 (https://phabricator.wikimedia.org/T351458) (owner: 10Genoveva Galarza)
[14:01:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P80917 and previous config saved to /var/cache/conftool/dbconfig/20250806-140130-fceratto.json
[14:01:32] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli)
[14:03:21] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-07-29-155618 to 2025-08-01-154925 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176230 (https://phabricator.wikimedia.org/T351458) (owner: 10Genoveva Galarza)
[14:05:54] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:06:41] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:07:14] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:07:44] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:07:54] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:08:23] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:08:52] <wikibugs>	 (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade evaluators from 2025-07-30-130544 to 2025-08-05-075031 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176234 (https://phabricator.wikimedia.org/T386794) (owner: 10Genoveva Galarza)
[14:11:04] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-07-30-130544 to 2025-08-05-075031 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176234 (https://phabricator.wikimedia.org/T386794) (owner: 10Genoveva Galarza)
[14:12:25] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:13:09] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:13:27] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:14:06] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:14:14] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:14:22] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[14:14:44] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[14:14:54] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:14:57] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[14:15:50] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[14:16:25] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[14:16:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T399728)', diff saved to https://phabricator.wikimedia.org/P80918 and previous config saved to /var/cache/conftool/dbconfig/20250806-141638-fceratto.json
[14:16:42] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[14:16:54] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1234.eqiad.wmnet with reason: Maintenance
[14:17:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T399728)', diff saved to https://phabricator.wikimedia.org/P80919 and previous config saved to /var/cache/conftool/dbconfig/20250806-141701-fceratto.json
[14:17:22] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[14:18:03] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2012.codfw.wmnet w/ force delete existing files, repooling both afterwards
[14:18:06] <stashbot>	 T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098
[14:18:06] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[14:19:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1011.eqiad.wmnet w/ force delete existing files, repooling both afterwards
[14:20:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T399728)', diff saved to https://phabricator.wikimedia.org/P80920 and previous config saved to /var/cache/conftool/dbconfig/20250806-142046-fceratto.json
[14:23:27] <wikibugs>	 (03PS1) 10Zabe: Do not create a database table when a different provider is used [extensions/ApiFeatureUsage] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1176250 (https://phabricator.wikimedia.org/T397348)
[14:23:40] <wikibugs>	 (03PS1) 10Zabe: Do not create a database table when a different provider is used [extensions/ApiFeatureUsage] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176251 (https://phabricator.wikimedia.org/T397348)
[14:23:56] <icinga-wm>	 PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 2703 MB (3% inode=89%): /tmp 2703 MB (3% inode=89%): /var/tmp 2703 MB (3% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops
[14:25:25] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] php8.1: rebuild to pick up 8.1.33-1+wmf11u2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1175951 (https://phabricator.wikimedia.org/T383047) (owner: 10Scott French)
[14:29:02] <wikibugs>	 (03PS1) 10MVernon: swift: remove old nodes, drain & reweight SM C-J nodes [puppet] - 10https://gerrit.wikimedia.org/r/1176253 (https://phabricator.wikimedia.org/T391354)
[14:30:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1400)
[14:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1430)
[14:31:15] <wikibugs>	 (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1176253 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon)
[14:33:06] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[14:34:13] <wikibugs>	 (03CR) 10Zabe: multiversion: Move remaining dblist helper to WmfConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 (owner: 10Krinkle)
[14:35:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P80921 and previous config saved to /var/cache/conftool/dbconfig/20250806-143554-fceratto.json
[14:36:31] <wikibugs>	 (03CR) 10Eevans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1176253 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon)
[14:37:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:39:39] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127#11065568 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Demonstrating how this works, you can see that the two systems with these controllers in hav...
[14:40:19] <wikibugs>	 (03PS1) 10Sergio Gimeno: [Growth] Remove get-started notification variant delays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176254
[14:40:59] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: remove old nodes, drain & reweight SM C-J nodes [puppet] - 10https://gerrit.wikimedia.org/r/1176253 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon)
[14:41:41] <wikibugs>	 (03CR) 10Krinkle: multiversion: Move remaining dblist helper to WmfConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 (owner: 10Krinkle)
[14:47:26] <wikibugs>	 (03CR) 10Zabe: multiversion: Move remaining dblist helper to WmfConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 (owner: 10Krinkle)
[14:47:49] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11065593 (10MatthewVernon)
[14:47:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:48:02] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli)
[14:50:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2013.codfw.wmnet w/ force delete existing files, repooling both afterwards
[14:50:13] <wikibugs>	 (03CR) 10Tchanders: [C:04-1] "Looks like we can solve this by just moving the assignment of the edit right to the 'temp' group (i.e. removing the block linked to above)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[14:50:13] <stashbot>	 T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098
[14:50:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1012.eqiad.wmnet w/ force delete existing files, repooling both afterwards
[14:50:50] <wikibugs>	 (03CR) 10Krinkle: multiversion: Move remaining dblist helper to WmfConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 (owner: 10Krinkle)
[14:51:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P80922 and previous config saved to /var/cache/conftool/dbconfig/20250806-145101-fceratto.json
[14:57:27] <wikibugs>	 (03PS1) 10MVernon: swift: remove ms-be106[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1176258 (https://phabricator.wikimedia.org/T391354)
[15:06:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T399728)', diff saved to https://phabricator.wikimedia.org/P80923 and previous config saved to /var/cache/conftool/dbconfig/20250806-150609-fceratto.json
[15:06:13] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[15:06:24] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1235.eqiad.wmnet with reason: Maintenance
[15:06:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T399728)', diff saved to https://phabricator.wikimedia.org/P80924 and previous config saved to /var/cache/conftool/dbconfig/20250806-150631-fceratto.json
[15:07:28] <wikibugs>	 (03PS1) 10Brouberol: Update the image tag associated with PG 15 [puppet] - 10https://gerrit.wikimedia.org/r/1176261 (https://phabricator.wikimedia.org/T396037)
[15:09:30] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[15:09:45] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:10:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T399728)', diff saved to https://phabricator.wikimedia.org/P80925 and previous config saved to /var/cache/conftool/dbconfig/20250806-151017-fceratto.json
[15:17:58] <wikibugs>	 06SRE-OnFire, 10WMDE-TechWish-Maintenance, 10Sustainability (Incident Followup): Split out reusable Parsoid+Cite analysis module from scraper - https://phabricator.wikimedia.org/T401334 (10awight) 03NEW
[15:19:25] <wikibugs>	 (03PS1) 10Andrew Bogott: Magnum: require 'helm3' rather than 'helm' [puppet] - 10https://gerrit.wikimedia.org/r/1176262
[15:19:45] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:20:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Magnum: require 'helm3' rather than 'helm' [puppet] - 10https://gerrit.wikimedia.org/r/1176262 (owner: 10Andrew Bogott)
[15:20:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:22:35] <wikibugs>	 06SRE-OnFire, 10Cite (Sub-referencing), 10Sustainability (Incident Followup): Spike: define operational monitoring requirements for Cite error alerting - https://phabricator.wikimedia.org/T401335 (10awight) 03NEW
[15:25:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P80926 and previous config saved to /var/cache/conftool/dbconfig/20250806-152524-fceratto.json
[15:26:15] <wikibugs>	 (03CR) 10Clément Goubert: [C:04-1] kube-state-metrics: collect metrics for metadata.labels.username (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209 (owner: 10Effie Mouzeli)
[15:26:27] <wikibugs>	 06SRE-OnFire, 10Cite, 10VisualEditor, 13Patch-For-Review, and 4 others: Investigation: Write visual editor debug tool to produce Converter test cases - https://phabricator.wikimedia.org/T400311#11065763 (10awight)
[15:26:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065765 (10ssastry) I approve.
[15:27:20] <wikibugs>	 06SRE-OnFire, 10Cite, 10Cite (Sub-referencing), 10Sustainability (Incident Followup), 03WMDE-TechWish-Sprint-Cherry-Chocolate-Ice-Cream-2025-07-23: Tech debt: review uses of references list item id during Parsoid html2wt - https://phabricator.wikimedia.org/T400803#11065766 (10awight)
[15:29:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065767 (10ABreault-WMF) I think he also needs `parsoid-test-roots` https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules...
[15:31:10] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11065770 (10ssastry) The broader request is to add Otto to all groups that the other members of content-transform-team are part of. Thanks!
[15:37:02] <wikibugs>	 (03PS3) 10Effie Mouzeli: kube-state-metrics: collect metrics for metadata.labels.username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209
[15:37:12] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] datahub: increase memory for frontend and mae-consumer pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176187 (https://phabricator.wikimedia.org/T398599) (owner: 10Brouberol)
[15:37:26] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub: increase memory for frontend and mae-consumer pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176187 (https://phabricator.wikimedia.org/T398599) (owner: 10Brouberol)
[15:39:30] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to <wmde and nda>for <sadiyamohammed13> - https://phabricator.wikimedia.org/T401118#11065825 (10KFrancis) Hi all, confirming receipt of this request.  Please confirm Halima Sadiya Mohammed is the user's full name and please p...
[15:40:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P80927 and previous config saved to /var/cache/conftool/dbconfig/20250806-154032-fceratto.json
[15:42:54] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.197.0" for 169 host(s)
[15:44:06] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add collation to the list of sqooped table [puppet] - 10https://gerrit.wikimedia.org/r/1175924 (https://phabricator.wikimedia.org/T397923) (owner: 10Aleksandar Mastilovic)
[15:46:04] <wikibugs>	 (03CR) 10MVernon: "If you could eyeball this today-your-working-day, please, I can deploy tomorrow-my-working-day and get the hosts decommissioned. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1176258 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon)
[15:46:41] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2013.codfw.wmnet w/ force delete existing files, repooling both afterwards
[15:46:49] <stashbot>	 T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098
[15:48:05] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.197.0" completed for 169 hosts
[15:48:45] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1012.eqiad.wmnet w/ force delete existing files, repooling both afterwards
[15:49:57] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gitlab: adjust nftables throttling thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1176246 (https://phabricator.wikimedia.org/T400971) (owner: 10Jelto)
[15:52:25] <wikibugs>	 (03CR) 10Hashar: [C:03+1] gerrit: add daemons ssh host key to known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar)
[15:55:20] <wikibugs>	 (03PS1) 10Santiago Faci: xLab: Deploy v0.8.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176266 (https://phabricator.wikimedia.org/T401316)
[15:55:37] <wikibugs>	 (03PS2) 10Santiago Faci: xLab: Deploy v0.8.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176266 (https://phabricator.wikimedia.org/T401316)
[15:55:40] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T399728)', diff saved to https://phabricator.wikimedia.org/P80928 and previous config saved to /var/cache/conftool/dbconfig/20250806-155540-fceratto.json
[15:55:44] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[15:55:56] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[15:57:25] <wikibugs>	 (03PS1) 10Santiago Faci: xLab: Deploy v0.8.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176267 (https://phabricator.wikimedia.org/T401316)
[15:57:47] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[15:59:33] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1251.eqiad.wmnet with reason: Maintenance
[15:59:40] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1251 (T399728)', diff saved to https://phabricator.wikimedia.org/P80929 and previous config saved to /var/cache/conftool/dbconfig/20250806-155939-fceratto.json
[16:00:41] <wikibugs>	 (03CR) 10Effie Mouzeli: kube-state-metrics: collect metrics for metadata.labels.username (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209 (owner: 10Effie Mouzeli)
[16:00:44] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.8.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176266 (https://phabricator.wikimedia.org/T401316) (owner: 10Santiago Faci)
[16:00:57] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.8.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176267 (https://phabricator.wikimedia.org/T401316) (owner: 10Santiago Faci)
[16:02:44] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploy v0.8.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176266 (https://phabricator.wikimedia.org/T401316) (owner: 10Santiago Faci)
[16:03:02] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploy v0.8.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176267 (https://phabricator.wikimedia.org/T401316) (owner: 10Santiago Faci)
[16:03:24] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T399728)', diff saved to https://phabricator.wikimedia.org/P80930 and previous config saved to /var/cache/conftool/dbconfig/20250806-160323-fceratto.json
[16:03:28] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[16:04:26] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:05:16] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:05:40] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:06:30] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:07:06] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54369 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:07:16] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:07:58] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] kube-state-metrics: collect metrics for metadata.labels.username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209 (owner: 10Effie Mouzeli)
[16:13:49] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "thank you, makes sense, checked man page" [puppet] - 10https://gerrit.wikimedia.org/r/1176198 (https://phabricator.wikimedia.org/T400645) (owner: 10Jaime Nuche)
[16:16:39] <wikibugs>	 (03Abandoned) 10Dzahn: phabricator: block some scrapers and bots at apache level [puppet] - 10https://gerrit.wikimedia.org/r/1175933 (owner: 10Dzahn)
[16:18:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P80931 and previous config saved to /var/cache/conftool/dbconfig/20250806-161831-fceratto.json
[16:20:32] <icinga-wm>	 PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 155062 MB (4% inode=99%): /var/lib/hadoop/data/h 155405 MB (4% inode=99%): /var/lib/hadoop/data/b 167701 MB (4% inode=99%): /var/lib/hadoop/data/k 145243 MB (3% inode=99%): /var/lib/hadoop/data/m 153546 MB (4% inode=99%): /var/lib/hadoop/data/f 158147 MB (4% inode=99%): /var/lib/hadoop/data/j 157955 MB (4% inode=99%): /var/lib/hadoop/data
[16:20:32] <icinga-wm>	 2 MB (4% inode=99%): /var/lib/hadoop/data/l 164977 MB (4% inode=99%): /var/lib/hadoop/data/i 151408 MB (4% inode=99%): /var/lib/hadoop/data/g 152199 MB (4% inode=99%): /var/lib/hadoop/data/c 155277 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops
[16:24:32] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "LGTM - LMK if you need help rolling this out." [puppet] - 10https://gerrit.wikimedia.org/r/1175928 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming)
[16:24:40] <wikibugs>	 (03CR) 10Scott French: profile::hcaptcha::proxy: config improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli)
[16:30:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:32:37] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[16:33:12] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[16:33:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P80934 and previous config saved to /var/cache/conftool/dbconfig/20250806-163338-fceratto.json
[16:34:59] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply
[16:37:51] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply
[16:37:51] <wikibugs>	 (03CR) 10Federico Ceratto: "I see the 3 hosts already drained in modules/swift/files/eqiad-prod_hosts.yaml and they match the regex ms-be106[1-3] as described" [puppet] - 10https://gerrit.wikimedia.org/r/1176258 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon)
[16:38:25] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] Alertmanager: add receiver and routing for experiment-platform tasks [puppet] - 10https://gerrit.wikimedia.org/r/1175928 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming)
[16:38:30] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "LGTM, see previous comment" [puppet] - 10https://gerrit.wikimedia.org/r/1176258 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon)
[16:45:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:48:47] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T399728)', diff saved to https://phabricator.wikimedia.org/P80935 and previous config saved to /var/cache/conftool/dbconfig/20250806-164846-fceratto.json
[16:48:51] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[16:49:02] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[16:49:31] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:52:51] <Daimona>	 What have we got here?
[16:53:16] <Reedy>	 gerrit restart
[16:53:44] <Daimona>	 gotcha, ty
[16:54:07] <mutante>	 trying to revive it
[16:57:12] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[16:57:12] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[16:57:12] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[16:57:24] <sukhe>	 oh wow
[16:57:34] <sukhe>	 ah gerrit :)
[16:57:43] <mutante>	 I think I just got it back
[16:57:45] <mutante>	 wfm now
[16:57:53] <sukhe>	 thanks, forcing recheck
[16:58:20] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[16:58:20] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[16:58:21] <sukhe>	 I should port this over to alert manager
[16:58:22] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[16:59:08] <mutante>	 Reedy: ok now, right?
[16:59:31] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:00:04] <jouncebot>	 swfrench-wmf: Your horoscope predicts another MediaWiki infrastructure (UTC late) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1700).
[17:00:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:00:36] <swfrench-wmf>	 o/
[17:00:45] <swfrench-wmf>	 I'll be getting started here in a bit
[17:01:11] <tgr>	 o/
[17:01:46] <swfrench-wmf>	 thanks for sticking around, tgr :)
[17:01:54] <swfrench-wmf>	 I'll keep you posted on when things are ready to test
[17:02:16] <wikibugs>	 (03CR) 10Clare Ming: "thanks @cwhite@wikimedia.org! i added this patch to the bonus puppet window that @rlazarus@wikimedia.org set up for us -- hope that works " [puppet] - 10https://gerrit.wikimedia.org/r/1175928 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming)
[17:03:19] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] "No worries, it's all deployed! I figured this one didn't need any coordination so I just took care of it." [puppet] - 10https://gerrit.wikimedia.org/r/1175928 (https://phabricator.wikimedia.org/T398422) (owner: 10Clare Ming)
[17:03:54] <swfrench-wmf>	 !log reprepro include php8.1_8.1.33-1+wmf11u2 in component/php81 - T383047
[17:03:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:58] <stashbot>	 T383047: Could not send confirmation email: Unknown error in PHP's mail() function. - https://phabricator.wikimedia.org/T383047
[17:05:33] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] kube-state-metrics: collect metrics for metadata.labels.username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209 (owner: 10Effie Mouzeli)
[17:06:18] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Build locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1175951 (https://phabricator.wikimedia.org/T383047) (owner: 10Scott French)
[17:06:27] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Thanks for the review, Effie!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1175951 (https://phabricator.wikimedia.org/T383047) (owner: 10Scott French)
[17:06:44] <wikibugs>	 (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: rebuild to pick up 8.1.33-1+wmf11u2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1175951 (https://phabricator.wikimedia.org/T383047) (owner: 10Scott French)
[17:10:40] <swfrench-wmf>	 !log built and published php8.1 production image stack at 8.1.33-1-s3 - T383047
[17:10:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:43] <stashbot>	 T383047: Could not send confirmation email: Unknown error in PHP's mail() function. - https://phabricator.wikimedia.org/T383047
[17:11:21] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Deployment to pick up new 8.1.33-1-s3 production images - T383047
[17:12:22] <swfrench-wmf>	 tgr: since this requires a full image rebuild, it'll probably 15-20m until the new image is live in mw-debug
[17:15:05] <logmsgbot>	 !log amastilovic@deploy1003 Started deploy [analytics/refinery@2178dda] (hadoop-test): Updates to sqoop TEST [analytics/refinery@2178dda8]
[17:15:08] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add an IP to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1176288
[17:15:59] <logmsgbot>	 !log amastilovic@deploy1003 Finished deploy [analytics/refinery@2178dda] (hadoop-test): Updates to sqoop TEST [analytics/refinery@2178dda8] (duration: 00m 53s)
[17:16:54] <logmsgbot>	 !log amastilovic@deploy1003 Started deploy [analytics/refinery@2178dda]: Updates to sqoop [analytics/refinery@2178dda8]
[17:17:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2014.codfw.wmnet w/ force delete existing files, repooling both afterwards
[17:17:28] <stashbot>	 T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098
[17:19:08] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: add an IP to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1176288 (owner: 10Dzahn)
[17:19:23] <logmsgbot>	 !log amastilovic@deploy1003 Finished deploy [analytics/refinery@2178dda]: Updates to sqoop [analytics/refinery@2178dda8] (duration: 02m 29s)
[17:19:34] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1013.eqiad.wmnet w/ force delete existing files, repooling both afterwards
[17:19:43] <logmsgbot>	 !log amastilovic@deploy1003 Started deploy [analytics/refinery@2178dda] (thin): Updates to sqoop THIN [analytics/refinery@2178dda8]
[17:20:51] <logmsgbot>	 !log amastilovic@deploy1003 Finished deploy [analytics/refinery@2178dda] (thin): Updates to sqoop THIN [analytics/refinery@2178dda8] (duration: 01m 08s)
[17:30:38] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2245.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:32:52] <logmsgbot>	 !log swfrench@deploy1003 swfrench: Deployment to pick up new 8.1.33-1-s3 production images - T383047 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:32:55] <stashbot>	 T383047: Could not send confirmation email: Unknown error in PHP's mail() function. - https://phabricator.wikimedia.org/T383047
[17:33:29] <swfrench-wmf>	 tgr: we're live in mw-debug if there's anything you'd like to check there
[17:33:56] <tgr>	 thanks, checking
[17:34:33] <swfrench-wmf>	 just successfully Special:EmailUser'd myself, so at least I've not borked anything horribly, heh
[17:38:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:40:17] <bd808>	 mutante: is the gerrit probe failure expected?
[17:42:37] <mutante>	 bd808: no. but we just blocked some abuse.. 
[17:42:56] <tgr>	 swfrench-wmf: I went through the common workflows involving email, they all work
[17:43:09] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:43:12] <swfrench-wmf>	 tgr: amazing, thank you very much!
[17:43:42] <swfrench-wmf>	 I'll continue, and we can see how things improve w.r.t. error handling over the next couple of hours
[17:43:54] <mutante>	 bd808: should recover in a sec but WIP
[17:44:31] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:44:36] <logmsgbot>	 !log swfrench@deploy1003 swfrench: Continuing with sync
[17:47:52] <bd808>	 mutante: ack. thanks for chasing the ghosts that keep messing with us
[17:48:09] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:52:17] <logmsgbot>	 jhancock@cumin1003 provision (PID 882499) is awaiting input
[17:53:09] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:54:31] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:56:05] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: Deployment to pick up new 8.1.33-1-s3 production images - T383047 (duration: 45m 10s)
[17:56:08] <stashbot>	 T383047: Could not send confirmation email: Unknown error in PHP's mail() function. - https://phabricator.wikimedia.org/T383047
[17:56:45] <swfrench-wmf>	 alright, I should be done with (what little remains of) the infra window
[17:58:09] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:58:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:00:05] <jouncebot>	 hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T1800)
[18:03:54] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2245.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:06:56] <wikibugs>	 (03PS1) 10CDanis: benthos: webrequest_sampled_live: remove client_port [puppet] - 10https://gerrit.wikimedia.org/r/1176295 (https://phabricator.wikimedia.org/T398236)
[18:06:58] <wikibugs>	 (03PS1) 10CDanis: turnilo: webrequest_sampled_live: remove client_port [puppet] - 10https://gerrit.wikimedia.org/r/1176296 (https://phabricator.wikimedia.org/T398236)
[18:11:42] <wikibugs>	 (03PS1) 10Dzahn: gerrit: block abuse from Alibaba Cloud / aliyun [puppet] - 10https://gerrit.wikimedia.org/r/1176297
[18:11:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: block abuse from Alibaba Cloud / aliyun [puppet] - 10https://gerrit.wikimedia.org/r/1176297 (owner: 10Dzahn)
[18:12:10] <wikibugs>	 (03PS2) 10Dzahn: gerrit: block abuse from Alibaba Cloud / aliyun [puppet] - 10https://gerrit.wikimedia.org/r/1176297
[18:13:49] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2014.codfw.wmnet w/ force delete existing files, repooling both afterwards
[18:13:52] <stashbot>	 T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098
[18:15:52] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: block abuse from Alibaba Cloud / aliyun [puppet] - 10https://gerrit.wikimedia.org/r/1176297 (owner: 10Dzahn)
[18:17:00] <swfrench-wmf>	 brennen: I see that the train rolled during the earlier window. would it be alright if I sneak in some infra-related changes (enabling PHP 8.3 image builds) during this window?
[18:17:44] <brennen>	 swfrench-wmf: yep, nothing train-related for this window.  go for it.
[18:17:50] <swfrench-wmf>	 amazing
[18:17:54] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1013.eqiad.wmnet w/ force delete existing files, repooling both afterwards
[18:18:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:23:03] <swfrench-wmf>	 FYI, I won't be taking any action until closer to 19:00 UTC
[18:23:09] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:23:22] <mutante>	 swfrench-wmf: currently fighting gerrit problems
[18:23:41] <swfrench-wmf>	 ack, thanks mutante!
[18:23:58] <swfrench-wmf>	 (that's also part of why I'm holding)
[18:24:02] <mutante>	 great
[18:24:11] <wikibugs>	 (03PS1) 10CDanis: benthos webrequest: Add wmfuniq X-Analytics sub-field [puppet] - 10https://gerrit.wikimedia.org/r/1176299 (https://phabricator.wikimedia.org/T400753)
[18:24:14] <wikibugs>	 (03PS1) 10CDanis: turnilo: webrequest: add wmfuniq X-Analytics sub-field [puppet] - 10https://gerrit.wikimedia.org/r/1176300 (https://phabricator.wikimedia.org/T400753)
[18:24:31] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:29:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] benthos webrequest: Add wmfuniq X-Analytics sub-field [puppet] - 10https://gerrit.wikimedia.org/r/1176299 (https://phabricator.wikimedia.org/T400753) (owner: 10CDanis)
[18:29:31] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:33:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:34:26] <wikibugs>	 (03CR) 10CDanis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1176299 (https://phabricator.wikimedia.org/T400753) (owner: 10CDanis)
[18:38:00] <mutante>	 swfrench-wmf: for now it seems better
[18:38:36] <wikibugs>	 (03PS1) 10CDanis: [WIP] haproxy: silent-drop as early as possible [puppet] - 10https://gerrit.wikimedia.org/r/1176302
[18:42:44] <wikibugs>	 (03PS1) 10Dzahn: Revert "gerrit: add an IP to abusers list" [puppet] - 10https://gerrit.wikimedia.org/r/1176303
[18:42:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "gerrit: add an IP to abusers list" [puppet] - 10https://gerrit.wikimedia.org/r/1176303 (owner: 10Dzahn)
[18:43:41] <wikibugs>	 06SRE, 10SRE-SLO, 06Traffic: Page on ATS backend errors relative to traffic - https://phabricator.wikimedia.org/T400675#11066418 (10RLazarus) We talked about this in the SLO meeting today -- one possible approach is to keep `ATSBackendErrorsHigh` as a default policy, but keep a list of services to //exclude/...
[18:47:53] <swfrench-wmf>	 mutante: awesome, thank you!
[18:51:42] <wikibugs>	 (03PS2) 10Dzahn: Revert "gerrit: add an IP to abusers list" [puppet] - 10https://gerrit.wikimedia.org/r/1176303
[18:52:36] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2245']
[18:52:50] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2245']
[18:53:17] <wikibugs>	 (03PS3) 10Dzahn: Revert "gerrit: add an IP to abusers list" [puppet] - 10https://gerrit.wikimedia.org/r/1176303
[18:54:44] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host db2245.codfw.wmnet with OS bookworm
[18:54:51] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11066430 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host db2245.codfw.wmnet with OS bookworm
[18:54:51] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "gerrit: add an IP to abusers list" [puppet] - 10https://gerrit.wikimedia.org/r/1176303 (owner: 10Dzahn)
[18:55:30] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host db2246.codfw.wmnet with OS bookworm
[18:55:38] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11066431 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host db2246.codfw.wmnet with OS bookworm
[18:55:51] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host db2247.codfw.wmnet with OS bookworm
[18:56:00] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11066432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host db2247.codfw.wmnet with OS bookworm
[19:02:38] <swfrench-wmf>	 alright, I'll be getting started on those infra-related changes shortly
[19:08:47] <logmsgbot>	 jhancock@cumin1003 reimage (PID 890906) is awaiting input
[19:09:31] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[19:10:23] <logmsgbot>	 jhancock@cumin1003 reimage (PID 890975) is awaiting input
[19:11:32] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: No-op deployment to configure PHP 8.3 image builds - T399884
[19:11:35] <stashbot>	 T399884: Configure production MediaWiki image builds for PHP 8.3 - https://phabricator.wikimedia.org/T399884
[19:11:39] <logmsgbot>	 jhancock@cumin1003 reimage (PID 890989) is awaiting input
[19:30:19] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: No-op deployment to configure PHP 8.3 image builds - T399884 (duration: 19m 22s)
[19:30:23] <stashbot>	 T399884: Configure production MediaWiki image builds for PHP 8.3 - https://phabricator.wikimedia.org/T399884
[19:31:27] <swfrench-wmf>	 alright, I'm done with my changes
[19:40:32] <icinga-wm>	 PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 160519 MB (4% inode=99%): /var/lib/hadoop/data/h 155197 MB (4% inode=99%): /var/lib/hadoop/data/b 159570 MB (4% inode=99%): /var/lib/hadoop/data/k 149609 MB (3% inode=99%): /var/lib/hadoop/data/m 156022 MB (4% inode=99%): /var/lib/hadoop/data/f 161558 MB (4% inode=99%): /var/lib/hadoop/data/j 158060 MB (4% inode=99%): /var/lib/hadoop/data
[19:40:32] <icinga-wm>	 5 MB (4% inode=99%): /var/lib/hadoop/data/l 154186 MB (4% inode=99%): /var/lib/hadoop/data/i 157536 MB (4% inode=99%): /var/lib/hadoop/data/g 156551 MB (4% inode=99%): /var/lib/hadoop/data/c 156477 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops
[19:41:08] <icinga-wm>	 PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - free space: /srv 12143 MB (4% inode=69%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops
[19:43:14] <icinga-wm>	 PROBLEM - Host msw1-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[19:43:38] <icinga-wm>	 RECOVERY - Host msw1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms
[19:54:06] <papaul>	 !log maintenance goin on on msw1-eqiad 
[19:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:29] <papaul>	 ok doing it 
[19:55:51] <papaul>	 ok doing it 
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T2000).
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:05:34] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:06:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:07:20] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2245.codfw.wmnet with OS bookworm
[20:08:59] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2246.codfw.wmnet with OS bookworm
[20:09:57] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2247.codfw.wmnet with OS bookworm
[20:10:25] <wikibugs>	 (03CR) 10BPirkle: [C:03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175942 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz)
[20:25:49] <addshore>	 Is the ES servers for trace.wikimedia.org queryable manually at all?
[20:38:32] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.197.1" for 169 host(s)
[20:42:43] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.197.1" for 1 host(s)
[20:43:35] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.197.1" completed for 1 hosts
[20:53:51] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2015.codfw.wmnet w/ force delete existing files, repooling both afterwards
[20:53:55] <stashbot>	 T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098
[20:59:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1014.eqiad.wmnet w/ force delete existing files, repooling both afterwards
[20:59:27] <stashbot>	 T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T2100)
[21:00:32] <icinga-wm>	 PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 151429 MB (4% inode=99%): /var/lib/hadoop/data/h 151208 MB (4% inode=99%): /var/lib/hadoop/data/b 152901 MB (4% inode=99%): /var/lib/hadoop/data/k 145183 MB (3% inode=99%): /var/lib/hadoop/data/m 146753 MB (3% inode=99%): /var/lib/hadoop/data/f 153605 MB (4% inode=99%): /var/lib/hadoop/data/j 150407 MB (4% inode=99%): /var/lib/hadoop/data
[21:00:32] <icinga-wm>	 3 MB (4% inode=99%): /var/lib/hadoop/data/l 149050 MB (3% inode=99%): /var/lib/hadoop/data/i 154517 MB (4% inode=99%): /var/lib/hadoop/data/g 145566 MB (3% inode=99%): /var/lib/hadoop/data/c 148083 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops
[21:05:34] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:06:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:14:26] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart
[21:14:39] <logmsgbot>	 !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.wdqs.restart (exit_code=97)
[21:15:14] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart
[21:20:32] <icinga-wm>	 PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 127122 MB (3% inode=99%): /var/lib/hadoop/data/g 134715 MB (3% inode=99%): /var/lib/hadoop/data/j 134425 MB (3% inode=99%): /var/lib/hadoop/data/c 126480 MB (3% inode=99%): /var/lib/hadoop/data/b 133953 MB (3% inode=99%): /var/lib/hadoop/data/l 136852 MB (3% inode=99%): /var/lib/hadoop/data/k 121830 MB (3% inode=99%): /var/lib/hadoop/data
[21:20:32] <icinga-wm>	 4 MB (3% inode=99%): /var/lib/hadoop/data/i 135656 MB (3% inode=99%): /var/lib/hadoop/data/m 135499 MB (3% inode=99%): /var/lib/hadoop/data/d 130019 MB (3% inode=99%): /var/lib/hadoop/data/h 135665 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops
[21:34:24] <wikibugs>	 (03CR) 10Eevans: [C:03+1] swift: remove ms-be106[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1176258 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon)
[21:50:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2015.codfw.wmnet w/ force delete existing files, repooling both afterwards
[21:51:00] <stashbot>	 T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098
[21:53:10] <wikibugs>	 (03PS28) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608)
[21:57:54] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1014.eqiad.wmnet w/ force delete existing files, repooling both afterwards
[21:57:58] <stashbot>	 T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098
[21:59:27] <wikibugs>	 (03CR) 10CDobbins: dnsrecursor: add recursor.yml.erb (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250806T2200)
[22:09:55] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99)
[22:46:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[22:49:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1036.mgmt:22 - https://phabricator.wikimedia.org/T401210#11066919 (10VRiley-WMF) 05Open→03Resolved Reseated cable and it seems to have come back online.
[22:51:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[22:57:20] <wikibugs>	 (03PS4) 10Cwhite: prometheus::elasticsearch_exporter: make extra_config field optional [puppet] - 10https://gerrit.wikimedia.org/r/1176247 (https://phabricator.wikimedia.org/T401278)
[22:57:20] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] "PCC OK: https://puppet-compiler.wmflabs.org/output/1176247/6508/" [puppet] - 10https://gerrit.wikimedia.org/r/1176247 (https://phabricator.wikimedia.org/T401278) (owner: 10Cwhite)
[23:09:31] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[23:12:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[23:17:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[23:19:17] <wikibugs>	 (03PS1) 10Dzahn: admin: add an alias to my own .bash_profile [puppet] - 10https://gerrit.wikimedia.org/r/1176322
[23:38:14] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176323
[23:38:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176323 (owner: 10TrainBranchBot)
[23:42:47] <wikibugs>	 (03PS1) 10Dzahn: jenkins: escape : with \ in sudoers privileges line [puppet] - 10https://gerrit.wikimedia.org/r/1176324 (https://phabricator.wikimedia.org/T400645)
[23:43:05] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/1176324" [puppet] - 10https://gerrit.wikimedia.org/r/1176198 (https://phabricator.wikimedia.org/T400645) (owner: 10Jaime Nuche)
[23:43:42] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] jenkins: escape : with \ in sudoers privileges line [puppet] - 10https://gerrit.wikimedia.org/r/1176324 (https://phabricator.wikimedia.org/T400645) (owner: 10Dzahn)
[23:55:09] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176323 (owner: 10TrainBranchBot)