[00:08:15] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1164702
[00:08:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1164702 (owner: 10TrainBranchBot)
[00:13:31] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[00:30:22] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1164702 (owner: 10TrainBranchBot)
[00:46:19] <wikibugs>	 (03PS1) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1164703
[00:46:28] <icinga-wm>	 RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[01:12:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[01:22:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[01:28:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:03:32] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:38:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:42:42] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:37:27] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:45:42] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[04:13:31] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:58:55] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] "Removing the -2 per T397940#10956874" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders)
[05:59:30] <wikibugs>	 (03PS8) 10Tchanders: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940)
[06:00:04] <wikibugs>	 (03PS9) 10Kosta Harlan: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders)
[06:01:29] <wikibugs>	 (03PS10) 10Kosta Harlan: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders)
[06:04:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders)
[06:04:36] <kostajh>	 jouncebot: nowandnext
[06:04:36] <jouncebot>	 For the next 0 hour(s) and 55 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250629T0700)
[06:04:36] <jouncebot>	 In 0 hour(s) and 55 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T0700)
[06:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:16:50] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml offboarding trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1164707
[06:18:11] <wikibugs>	 (03CR) 10Muehlenhoff: data.yaml offboarding trokhymovych (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164707 (owner: 10Slyngshede)
[06:18:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data.yaml offboarding trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1164707 (owner: 10Slyngshede)
[06:20:04] <wikibugs>	 (03PS2) 10Slyngshede: data.yaml offboarding trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1164707
[06:21:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:04-1] "Pending extension, see the mail thread with Diego" [puppet] - 10https://gerrit.wikimedia.org/r/1164707 (owner: 10Slyngshede)
[06:26:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ganeti role to ganeti2049 [puppet] - 10https://gerrit.wikimedia.org/r/1164708 (https://phabricator.wikimedia.org/T396590)
[06:29:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add ganeti role to ganeti2049 [puppet] - 10https://gerrit.wikimedia.org/r/1164708 (https://phabricator.wikimedia.org/T396590) (owner: 10Muehlenhoff)
[06:31:37] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml offboarding mnz [puppet] - 10https://gerrit.wikimedia.org/r/1164709
[06:32:35] <XioNoX>	 !log push pfw policies - T397875
[06:32:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1164709 (owner: 10Slyngshede)
[06:35:36] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] data.yaml offboarding mnz [puppet] - 10https://gerrit.wikimedia.org/r/1164709 (owner: 10Slyngshede)
[06:37:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti2049:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:38:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet
[06:38:32] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:40:02] <logmsgbot>	 !log slyngshede@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muniza out of all services on: 2400 hosts
[06:40:22] <wikibugs>	 (03PS3) 10Ayounsi: Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473
[06:40:58] <wikibugs>	 06SRE, 06Data-Engineering: WE 5.4 FY 25/26: Improve automata detection at the edge and pass it to the refinery pipeline - https://phabricator.wikimedia.org/T396562#10957224 (10Joe) >>! In T396562#10907844, @JAllemandou wrote:  > I think this would be feasible as most of frontend data is already available in th...
[06:45:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet
[06:47:10] <wikibugs>	 (03PS1) 10Volans: JS/CSS: fix CSP headers and CDN inclusion [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164859 (https://phabricator.wikimedia.org/T397696)
[06:47:11] <wikibugs>	 (03PS1) 10Volans: JS/CSS: update DataTables library [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164860 (https://phabricator.wikimedia.org/T397696)
[06:51:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Update records for dhardy [puppet] - 10https://gerrit.wikimedia.org/r/1164861
[06:56:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: split pushgateway logs [puppet] - 10https://gerrit.wikimedia.org/r/1164862 (https://phabricator.wikimedia.org/T398091)
[06:56:37] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti2049.codfw.wmnet to cluster codfw and group B
[06:58:34] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2049.codfw.wmnet to cluster codfw and group B
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T0700).
[07:00:04] <jouncebot>	 koi, DreamRimmer, and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:28] <koi>	 o/
[07:04:25] <DreamRimmer>	 o/
[07:04:31] <kostajh>	 hi, I'm here 
[07:04:57] <DreamRimmer>	 for deployment?
[07:05:15] <kostajh>	 it seems like no other deployers are around, so, yes, I could do that 
[07:05:25] <kostajh>	 I'll have a look at the patches 
[07:08:01] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] prometheus: split pushgateway logs [puppet] - 10https://gerrit.wikimedia.org/r/1164862 (https://phabricator.wikimedia.org/T398091) (owner: 10Filippo Giunchedi)
[07:08:26] <DreamRimmer>	 I do not have anything to test these patches, but I hope you can proceed based on the +1 from other devs. I am backporting these at the request of the original uploader, as they are not currently available
[07:08:45] <DreamRimmer>	 Look good to me too
[07:10:26] <koi>	 my patch is simple enough and i can test it quickly
[07:12:32] <kostajh>	 I'm not totally sure about https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1164506, and would prefer that the author or reviewer are around to verify those changes 
[07:13:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163081 (https://phabricator.wikimedia.org/T397676) (owner: 10Stang)
[07:13:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: split pushgateway logs [puppet] - 10https://gerrit.wikimedia.org/r/1164862 (https://phabricator.wikimedia.org/T398091) (owner: 10Filippo Giunchedi)
[07:14:26] <wikibugs>	 (03Merged) 10jenkins-bot: zhwiki: Remove autopatrol from patroller group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163081 (https://phabricator.wikimedia.org/T397676) (owner: 10Stang)
[07:14:41] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957279 (10MoritzMuehlenhoff)
[07:14:54] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1163081|zhwiki: Remove autopatrol from patroller group (T397676)]]
[07:15:00] <stashbot>	 T397676: Remove autopatrol from patroller group on zhwiki - https://phabricator.wikimedia.org/T397676
[07:15:20] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet
[07:15:47] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957283 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs
[07:15:50] <NovemLinguae>	 kostajh: patch author here. what's up?
[07:16:10] <kostajh>	 hi NovemLinguae 
[07:16:40] <kostajh>	 NovemLinguae: would you be able to verify https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1164506 after deployment? Is there any risk of breakage when updating this config? 
[07:17:59] <NovemLinguae>	 the patch that introduces that config var, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/1090512, hasn't ridden the train yet, was merged on friday. might make it hard to test
[07:18:49] <NovemLinguae>	 if we want thorough testing, i guess we could backport that one too? or delay this backport a couple days?
[07:19:42] <wikibugs>	 (03CR) 10Kosta Harlan: initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164506 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae)
[07:20:25] <kostajh>	 NovemLinguae: what changes would we expect to see once the train reaches group2, and your config patch is enabled? 
[07:22:02] <godog>	 !log bounce prometheus-pushgateway on prometheus1005 - T398091
[07:22:04] <NovemLinguae>	 SecurePoll local elections would start writing JSON to subpages of the page MediaWiki:SecurePoll, as a type of logging when certain poll settings pages are edited (Special:SecurePoll/create, Special:SecurePoll/edit, Special:SecurePoll/translate, Special:SecurePoll/votereligibility). This is identical to the behavior of an existing setting, with the only difference being the target page it writes to
[07:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:08] <stashbot>	 T398091: Prometheus1005 out of disk on / - https://phabricator.wikimedia.org/T398091
[07:22:57] <NovemLinguae>	 MediaWiki:SecurePoll and its subpages are configured as a read only namespace. The idea of doing this kind of logging is to provide a useful history tab to see who modified what settings when.
[07:23:32] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM, email is not active yet, but will be soon." [puppet] - 10https://gerrit.wikimedia.org/r/1164861 (owner: 10Muehlenhoff)
[07:25:25] <NovemLinguae>	 Re your code review comment, wmgSecurePollUseNamespace defaults to false. Can add enwiki = false if you want though. Up to you.
[07:25:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Update records for dhardy [puppet] - 10https://gerrit.wikimedia.org/r/1164861 (owner: 10Muehlenhoff)
[07:26:28] <kostajh>	 NovemLinguae: ack. So for now, nothing would change on enwiki today - it would just impact new polls that are created in the future? 
[07:27:10] <NovemLinguae>	 Correct. Starting on Thursday when the new setting rides the train, then CREATING or EDITING a LOCAL poll will start writing stuff to MediaWiki:SecurePoll/* subpages.
[07:27:24] <NovemLinguae>	 (unless we backport that patch right now)
[07:28:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164860 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[07:30:06] <kostajh>	 ok
[07:30:09] <kostajh>	 seems fine, then 
[07:30:32] <NovemLinguae>	 :)
[07:33:16] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet
[07:33:32] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:35:17] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2004.codfw.wmnet to drbd
[07:35:47] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957308 (10ops-monitoring-bot) VM aux-k8s-etcd2004.codfw.wmnet switching disk type to drbd
[07:37:39] <kostajh>	 still waiting on images to get built for the first patch 
[07:37:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:39:34] <kostajh>	 NovemLinguae: so, to confirm, there's nothing for you to verify when I sync your patches, right? 
[07:39:48] <kostajh>	 other than that enwiki still loads :) 
[07:40:06] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: disable nftables prometheus exporter script in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1164389 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto)
[07:41:41] <godog>	 !log restart prometheus-pushgateway on prometheus1005 with fresh state - T398091
[07:41:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:48] <stashbot>	 T398091: Prometheus1005 out of disk on / - https://phabricator.wikimedia.org/T398091
[07:42:03] <NovemLinguae>	 correct. I don't think I can test https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1164507 either since it doesn't affect enwiki. someone with the proper election administrator user group on those wikis could go edit a test poll, then check their contribs to make sure an edit was made
[07:42:16] <NovemLinguae>	 (testwiki, votewiki)
[07:42:49] <NovemLinguae>	 if a diff link is given i'd be happy to analyze it and confirm that it's working
[07:43:16] <Tran>	 I might still have rights if we want to check? Let me confirm
[07:43:51] <Tran>	 Yes I'm still an election admin on votewiki if we want to QA the change.
[07:44:09] <kostajh>	 Tran: cool, thanks. Will let you know when the changes are staged 
[07:44:13] <logmsgbot>	 !log kharlan@deploy1003 stang, kharlan: Backport for [[gerrit:1163081|zhwiki: Remove autopatrol from patroller group (T397676)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:44:20] <stashbot>	 T397676: Remove autopatrol from patroller group on zhwiki - https://phabricator.wikimedia.org/T397676
[07:44:24] <kostajh>	 koi: your change is available for testing now 
[07:44:31] <koi>	 lookin
[07:45:01] <koi>	 kostajh: LGTM
[07:45:08] <logmsgbot>	 !log kharlan@deploy1003 stang, kharlan: Continuing with sync
[07:45:10] <kostajh>	 koi: thanks
[07:45:42] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[07:48:30] <moritzm>	 !log installing krb5 security updates
[07:48:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:15] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host apt1002.wikimedia.org
[07:52:24] <wikibugs>	 (03PS10) 10Vgutierrez: cache,haproxy: refactor haproxy captures to fix x-analytics logging [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur)
[07:52:28] <wikibugs>	 06SRE: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#10957327 (10Aklapper)
[07:54:03] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur)
[07:56:23] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd2004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:56:42] <vgutierrez>	 uh... expected?
[07:57:01] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt1002.wikimedia.org
[07:57:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[07:57:30] <wikibugs>	 (03PS1) 10Jelto: gitlab: fix typo in hiera config [puppet] - 10https://gerrit.wikimedia.org/r/1164944 (https://phabricator.wikimedia.org/T396622)
[07:58:20] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163081|zhwiki: Remove autopatrol from patroller group (T397676)]] (duration: 43m 26s)
[07:58:26] <stashbot>	 T397676: Remove autopatrol from patroller group on zhwiki - https://phabricator.wikimedia.org/T397676
[07:59:23] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2004.codfw.wmnet to drbd
[08:00:02] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6104/console" [puppet] - 10https://gerrit.wikimedia.org/r/1164944 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto)
[08:00:15] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd2004 is UP: PING OK - Packet loss = 0%, RTA = 30.77 ms
[08:00:16] <moritzm>	 vgutierrez: yeah, I had a switch the disk type away from local disk storage, since the underlying Ganeti node is being decommed
[08:00:26] <moritzm>	 and that needs a  reboot to effect the change
[08:00:27] <vgutierrez>	 ack, thx
[08:01:02] <kostajh>	 ok, on to the securepoll patches
[08:01:08] <kostajh>	 that ook a long time!
[08:01:13] <kostajh>	 *took
[08:01:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164506 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae)
[08:01:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164507 (owner: 10Novem Linguae)
[08:01:50] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: fix typo in hiera config [puppet] - 10https://gerrit.wikimedia.org/r/1164944 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto)
[08:02:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:02:28] <wikibugs>	 (03Merged) 10jenkins-bot: initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164506 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae)
[08:02:31] <wikibugs>	 (03Merged) 10jenkins-bot: refactor unnecessary wmgSecurePollUseNamespace variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164507 (owner: 10Novem Linguae)
[08:02:47] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1164506|initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki (T398080)]], [[gerrit:1164507|refactor unnecessary wmgSecurePollUseNamespace variable]]
[08:02:53] <stashbot>	 T398080: Set $wgSecurePollUseMediaWikiNamespace = true on English Wikipedia - https://phabricator.wikimedia.org/T398080
[08:04:29] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet
[08:04:46] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957357 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs
[08:07:03] <logmsgbot>	 !log kharlan@deploy1003 novemlinguae, kharlan: Backport for [[gerrit:1164506|initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki (T398080)]], [[gerrit:1164507|refactor unnecessary wmgSecurePollUseNamespace variable]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:07:35] <NovemLinguae>	 for testing, can visit Special:SecurePoll and click around and make sure no obvious PHP errors, etc.
[08:07:49] <kostajh>	 Tran: ok, ready for review now 
[08:07:54] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet
[08:08:18] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2004.codfw.wmnet to plain
[08:08:46] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957365 (10ops-monitoring-bot) VM aux-k8s-etcd2004.codfw.wmnet switching disk type to plain
[08:08:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] cache,haproxy: refactor haproxy captures to fix x-analytics logging [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur)
[08:08:57] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2004.codfw.wmnet to plain
[08:10:00] <NovemLinguae>	 I clicked around Special:SecurePoll and Special:SecurePoll/translate, no errors. (Not that I'd expect any for setting an undeployed config option, but just to be sure.)
[08:10:16] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet
[08:10:30] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957366 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs
[08:10:56] <Tran>	 Triggered a few intentional errors on the create page, changed voter eligibility, ran a tally - no errors, saw the voter eligibility change logged
[08:11:01] <kostajh>	 ok, cool
[08:11:03] <kostajh>	 thanks!
[08:11:32] <logmsgbot>	 !log kharlan@deploy1003 novemlinguae, kharlan: Continuing with sync
[08:12:29] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet
[08:13:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164859 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[08:13:31] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[08:13:32] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:13:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10957372 (10MoritzMuehlenhoff)
[08:13:50] <_joe_>	 elukey: is the docker-reporter thing you?
[08:14:26] <kostajh>	 Does anyone know if it's possible to remove someone else's -2 on a patch? (this patch in particular https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1163738)
[08:14:49] <elukey>	 _joe_ no I think Moritz was working on it, but it must be another occurrence of the v2 vs v1 api, need to add it to the exclude filter
[08:16:28] <wikibugs>	 (03PS1) 10Majavah: openstack: puppet-enc: Return helpful error for invalid role data [puppet] - 10https://gerrit.wikimedia.org/r/1164945 (https://phabricator.wikimedia.org/T398117)
[08:16:54] <moritzm>	 _joe_:  we're looking into it, seems introduced by https://phabricator.wikimedia.org/T395826
[08:17:12] <_joe_>	 ^_^
[08:17:13] <taavi>	 kostajh: hover over the code-review item over "Submit Requirements" in the left sidebar, and there is a trash can icon to delete a vote
[08:17:32] <_joe_>	 well *WMCS* doesn't really matter here, so we need to exclude that cluster I guess?
[08:19:02] <kostajh>	 taavi: thanks, but I am not seeing "Submit requirements"
[08:19:05] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164506|initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki (T398080)]], [[gerrit:1164507|refactor unnecessary wmgSecurePollUseNamespace variable]] (duration: 16m 17s)
[08:19:10] <stashbot>	 T398080: Set $wgSecurePollUseMediaWikiNamespace = true on English Wikipedia - https://phabricator.wikimedia.org/T398080
[08:19:59] <NovemLinguae>	 kostajh: i see it. screenshot: https://imgur.com/a/WQqD07j
[08:20:05] <wikibugs>	 (03PS1) 10Elukey: docker-report: fix k8s filter [puppet] - 10https://gerrit.wikimedia.org/r/1164946
[08:20:28] <taavi>	 kostajh: hmm, this is what the left sidebar looks for me: https://prod-misc-upload.public.object.majava.org/taavi/0JySJ1nI1Stut.png do you not see that bottom bit?
[08:20:28] <kostajh>	 I think it's because in this case, the -2 is from the owner 
[08:20:29] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] cache,haproxy: refactor haproxy captures to fix x-analytics logging [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur)
[08:20:42] <kostajh>	 aha
[08:20:43] <kostajh>	 thanks
[08:21:01] <elukey>	 _joe_ we have a proposal to make debmonitor checking only images running on k8s clusters, it should make things easier. It doesn't make sense to track everything like we do now..
[08:21:23] <wikibugs>	 (03PS1) 10Volans: kubernetes: skip missing/failing images on update [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164947 (https://phabricator.wikimedia.org/T397696)
[08:22:06] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "nice ! LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1163858 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[08:22:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders)
[08:22:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1164946 (owner: 10Elukey)
[08:22:49] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth)
[08:23:08] <wikibugs>	 (03Merged) 10jenkins-bot: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders)
[08:23:13] <wikibugs>	 (03CR) 10Elukey: [C:03+2] docker-report: fix k8s filter [puppet] - 10https://gerrit.wikimedia.org/r/1164946 (owner: 10Elukey)
[08:23:21] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1163738|temp accounts: Enable temp account creation on further wikis (T397940)]]
[08:23:27] <stashbot>	 T397940: Batch 3 deployment of Temp Accounts Major pilots - https://phabricator.wikimedia.org/T397940
[08:24:13] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "You can also update the doc once it's merged https://wikitech.wikimedia.org/wiki/Anycast" [puppet] - 10https://gerrit.wikimedia.org/r/1163858 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[08:25:11] <NovemLinguae>	 kostajh: thanks for doing backports this morning. is nice to see folks helping in the UTC morning backport window, which is sometimes a bit quiet
[08:25:12] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "with one nit" [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[08:25:16] <logmsgbot>	 !log kharlan@deploy1003 kharlan, tchanders: Backport for [[gerrit:1163738|temp accounts: Enable temp account creation on further wikis (T397940)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:25:17] <NovemLinguae>	 appreciate it!
[08:25:35] <wikibugs>	 (03CR) 10Elukey: [C:03+1] JS/CSS: fix CSP headers and CDN inclusion [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164859 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[08:25:46] <kostajh>	 NovemLinguae: sure thing, thanks for the discussion about your patches, and for your work on SecurePoll!
[08:25:59] <NovemLinguae>	 yw :)
[08:26:33] <wikibugs>	 (03PS1) 10KartikMistry: Remove cxstats campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164948 (https://phabricator.wikimedia.org/T393705)
[08:26:36] <wikibugs>	 (03CR) 10Elukey: [C:03+1] JS/CSS: update DataTables library [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164860 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[08:26:58] <logmsgbot>	 !log kharlan@deploy1003 kharlan, tchanders: Continuing with sync
[08:29:28] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Very nice I like it!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164947 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[08:30:49] <_joe_>	 elukey: I agree
[08:31:31] <elukey>	 _joe_ we have some thing almost ready after the hackathon, more info to follow soon :)
[08:31:36] <elukey>	 *something
[08:31:36] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] doc: decom doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth)
[08:32:09] <_joe_>	 elukey: I think there's some value for the releng images too, which are ran constantly in CI, but that's not as important
[08:32:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:32:23] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163738|temp accounts: Enable temp account creation on further wikis (T397940)]] (duration: 09m 01s)
[08:32:28] <stashbot>	 T397940: Batch 3 deployment of Temp Accounts Major pilots - https://phabricator.wikimedia.org/T397940
[08:33:01] <elukey>	 _joe_ true true, I'd like to understand the use case though, because I have the sense that nobody really pay attention to those :D
[08:33:28] <wikibugs>	 (03CR) 10Volans: [C:03+2] JS/CSS: fix CSP headers and CDN inclusion [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164859 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[08:33:34] <_joe_>	 elukey: I agree, but that' s because we don't have a closed loop on maintenance
[08:33:50] <wikibugs>	 (03CR) 10Volans: [C:03+2] JS/CSS: update DataTables library [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164860 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[08:34:13] <wikibugs>	 (03CR) 10Volans: [C:03+2] kubernetes: skip missing/failing images on update [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164947 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[08:34:24] <wikibugs>	 (03Merged) 10jenkins-bot: JS/CSS: fix CSP headers and CDN inclusion [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164859 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[08:34:40] <wikibugs>	 (03Merged) 10jenkins-bot: JS/CSS: update DataTables library [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164860 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[08:35:04] <wikibugs>	 (03Merged) 10jenkins-bot: kubernetes: skip missing/failing images on update [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164947 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[08:35:27] <elukey>	 _joe_: we could also have separate instances, on for prod and one for CI
[08:35:49] <wikibugs>	 (03CR) 10Nikerabbit: [C:03+1] Remove cxstats campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164948 (https://phabricator.wikimedia.org/T393705) (owner: 10KartikMistry)
[08:37:06] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "cache,haproxy: refactor haproxy captures to fix x-analytics logging" [puppet] - 10https://gerrit.wikimedia.org/r/1164952
[08:37:31] <wikibugs>	 (03PS2) 10Vgutierrez: Revert "cache,haproxy: refactor haproxy captures to fix x-analytics logging" [puppet] - 10https://gerrit.wikimedia.org/r/1164952 (https://phabricator.wikimedia.org/T397917)
[08:37:39] <wikibugs>	 (03PS1) 10Volans: Upstream release v0.6.3 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164953
[08:37:40] <wikibugs>	 (03CR) 10David Caro: "LGTM, just a nit that does not need to be fixed in this patch, but would be good to keep in mind" [puppet] - 10https://gerrit.wikimedia.org/r/1164945 (https://phabricator.wikimedia.org/T398117) (owner: 10Majavah)
[08:37:45] <wikibugs>	 (03CR) 10David Caro: [C:03+1] openstack: puppet-enc: Return helpful error for invalid role data [puppet] - 10https://gerrit.wikimedia.org/r/1164945 (https://phabricator.wikimedia.org/T398117) (owner: 10Majavah)
[08:37:57] <wikibugs>	 (03CR) 10Volans: [C:03+2] Upstream release v0.6.3 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164953 (owner: 10Volans)
[08:38:22] <wikibugs>	 (03CR) 10Majavah: [C:03+2] openstack: puppet-enc: Return helpful error for invalid role data [puppet] - 10https://gerrit.wikimedia.org/r/1164945 (https://phabricator.wikimedia.org/T398117) (owner: 10Majavah)
[08:38:31] <wikibugs>	 (03CR) 10Majavah: [C:03+2] openstack: puppet-enc: Return helpful error for invalid role data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164945 (https://phabricator.wikimedia.org/T398117) (owner: 10Majavah)
[08:39:05] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v0.6.3 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164953 (owner: 10Volans)
[08:39:55] <kostajh>	 !log UTC morning deploys done
[08:39:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:00] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert "cache,haproxy: refactor haproxy captures to fix x-analytics logging" [puppet] - 10https://gerrit.wikimedia.org/r/1164952 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez)
[08:42:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:45:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:47:51] <volans>	 !log uploaded debmonitor-server,python3-debmonitor_0.6.3 to apt.wikimedia.org bookworm-wikimedia
[08:47:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:54] <elukey>	 \o/
[08:50:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.31% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:50:57] <XioNoX>	 !log test routed ganeti compatible bird on ganeti2034/testvm2006 - T362392
[08:51:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:03] <stashbot>	 T362392: Routed Ganeti: Add support for VM BGP - https://phabricator.wikimedia.org/T362392
[08:53:32] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:55:22] <wikibugs>	 (03PS1) 10Volans: CSP: fix headers when loading local resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164955 (https://phabricator.wikimedia.org/T397696)
[08:59:19] <wikibugs>	 (03CR) 10Elukey: [C:03+1] CSP: fix headers when loading local resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164955 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[09:00:50] <wikibugs>	 (03CR) 10Volans: [C:03+2] CSP: fix headers when loading local resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164955 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[09:01:41] <wikibugs>	 (03Merged) 10jenkins-bot: CSP: fix headers when loading local resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164955 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[09:02:53] <wikibugs>	 (03PS1) 10Tiziano Fogli: LibericaEtcdErrors: disable pint check for missing metrics [alerts] - 10https://gerrit.wikimedia.org/r/1164957 (https://phabricator.wikimedia.org/T396320)
[09:04:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet
[09:04:29] <wikibugs>	 (03PS1) 10Aklapper: Push Due Date value higher [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1164958
[09:04:30] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10957604 (10ops-monitoring-bot) Draining ganeti5004.eqsin.wmnet of running VMs
[09:04:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet
[09:05:20] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Push Due Date value higher [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1164958 (owner: 10Aklapper)
[09:06:05] <wikibugs>	 (03PS1) 10Tiziano Fogli: PyBalBGPUnstable: disable pint check for missing metrics [alerts] - 10https://gerrit.wikimedia.org/r/1164959 (https://phabricator.wikimedia.org/T396321)
[09:12:26] <wikibugs>	 (03CR) 10FNegri: [C:03+1] P:exim4::smarthost: Migrate to ec-prime256v1 certificates [puppet] - 10https://gerrit.wikimedia.org/r/1164483 (owner: 10Majavah)
[09:12:39] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:exim4::smarthost: Migrate to ec-prime256v1 certificates [puppet] - 10https://gerrit.wikimedia.org/r/1164483 (owner: 10Majavah)
[09:12:57] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1164945 (https://phabricator.wikimedia.org/T398117) (owner: 10Majavah)
[09:14:50] <logmsgbot>	 !log klausman@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ml-staging-ctrl2001.codfw.wmnet with reason: Shutting down for Ganeti migration
[09:18:51] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet
[09:19:06] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957626 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs
[09:19:53] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet
[09:20:17] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet
[09:20:37] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957628 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs
[09:20:50] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet
[09:23:07] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet
[09:23:39] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.remove-downtime for ml-staging-ctrl2001.codfw.wmnet
[09:23:40] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-staging-ctrl2001.codfw.wmnet
[09:26:48] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] changeprop: fix broken metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164422 (https://phabricator.wikimedia.org/T397970) (owner: 10Hnowlan)
[09:26:59] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961
[09:28:32] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:28:53] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: fix broken metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164422 (https://phabricator.wikimedia.org/T397970) (owner: 10Hnowlan)
[09:31:08] <wikibugs>	 (03PS2) 10Jgiannelos: mobileapps: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961
[09:38:33] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[09:39:06] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[09:40:21] <wikibugs>	 (03PS1) 10Muehlenhoff: docker-report: filter all zuul images in oci filter [puppet] - 10https://gerrit.wikimedia.org/r/1164963
[09:42:16] <wikibugs>	 (03CR) 10Elukey: [C:03+1] docker-report: filter all zuul images in oci filter [puppet] - 10https://gerrit.wikimedia.org/r/1164963 (owner: 10Muehlenhoff)
[09:42:19] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply
[09:42:58] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[09:43:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet
[09:44:01] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10957739 (10ops-monitoring-bot) Draining ganeti5004.eqsin.wmnet of running VMs
[09:45:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] docker-report: filter all zuul images in oci filter [puppet] - 10https://gerrit.wikimedia.org/r/1164963 (owner: 10Muehlenhoff)
[09:51:10] <wikibugs>	 (03PS1) 10Hnowlan: changeprop: add missing total_delay histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164964 (https://phabricator.wikimedia.org/T397970)
[09:51:13] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[09:51:53] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[09:53:32] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:53:59] <moritzm>	 !log installing nginx security updates
[09:54:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:19] <wikibugs>	 (03PS1) 10Jelto: gitlab: pass ensure flag to auto_restarts::service [puppet] - 10https://gerrit.wikimedia.org/r/1164965 (https://phabricator.wikimedia.org/T396622)
[09:56:08] <wikibugs>	 (03CR) 10Majavah: [C:03+2] toolforge: wmcs-package-build: Fix Aptly host name [puppet] - 10https://gerrit.wikimedia.org/r/1153586 (owner: 10Majavah)
[09:56:15] <wikibugs>	 (03CR) 10Majavah: [C:03+2] toolforge: wmcs-package-build: Remove unneeded escape [puppet] - 10https://gerrit.wikimedia.org/r/1153587 (https://phabricator.wikimedia.org/T396004) (owner: 10Majavah)
[09:56:27] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge: aptly: Install rsync for backups [puppet] - 10https://gerrit.wikimedia.org/r/1153588 (owner: 10Majavah)
[09:57:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testreduce1002.eqiad.wmnet
[09:57:43] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6105/console" [puppet] - 10https://gerrit.wikimedia.org/r/1164965 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto)
[10:00:05] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1000)
[10:00:44] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: pass ensure flag to auto_restarts::service [puppet] - 10https://gerrit.wikimedia.org/r/1164965 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto)
[10:01:09] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:01:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1002.eqiad.wmnet
[10:01:56] <wikibugs>	 (03PS1) 10Btullis: Airflow-main: Increase parallelism and related values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164967 (https://phabricator.wikimedia.org/T398164)
[10:05:20] <Emperor>	 !log depool codfw ms-swift for container DB repairs T383053
[10:05:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:26] <stashbot>	 T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053
[10:05:34] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw
[10:06:05] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54084 bytes in 9.965 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:06:58] <Emperor>	 !log repair wikipedia-commons-local-thumb.6e on ms-be2059 ms-be2058 ms-be2076 T383053
[10:07:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:18] <moritzm>	 log installing openssl security updates
[10:09:05] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:11:09] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 8.657 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:11:19] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1003.eqiad.wmnet with reason: Maintenance and reboot
[10:12:11] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:14:09] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:14:22] <wikibugs>	 (03PS1) 10Michael Große: Growth(enwiki): enable limiting Add a Link to new editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034)
[10:17:10] <Emperor>	 !log repair wikipedia-commons-local-thumb.99 on ms-be2064 T383053
[10:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:16] <stashbot>	 T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053
[10:18:03] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54084 bytes in 8.163 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:18:09] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 9.559 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:18:32] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:19:00] <wikibugs>	 (03PS1) 10Majavah: P:wmcs: ntp: Automatically restart the service after config changes [puppet] - 10https://gerrit.wikimedia.org/r/1164970 (https://phabricator.wikimedia.org/T398099)
[10:20:42] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1164970 (https://phabricator.wikimedia.org/T398099) (owner: 10Majavah)
[10:21:14] <logmsgbot>	 !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be2076.codfw.wmnet with reason: container db repair
[10:21:23] <wikibugs>	 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10957886 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=86ad3659-4dfe-4a19-8925-7580975c3341) set by mvernon@cumin2002 fo...
[10:22:05] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:22:09] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:22:28] <Emperor>	 !log repair wikipedia-commons-local-thumb.bb on ms-be2076 T383053
[10:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:34] <stashbot>	 T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053
[10:22:38] <wikibugs>	 (03PS1) 10Ayounsi: Remove some Arelion/NTT traffic engineering [homer/public] - 10https://gerrit.wikimedia.org/r/1164972 (https://phabricator.wikimedia.org/T377844)
[10:23:09] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:23:59] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:24:01] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54084 bytes in 5.572 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:24:07] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 8.038 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:24:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki: Bump mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164269 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris)
[10:24:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] mw-debug: Specify upstream_retry_policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164270 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris)
[10:24:53] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be2076.codfw.wmnet
[10:24:54] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2076.codfw.wmnet
[10:25:38] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=swift,name=codfw
[10:25:56] <Emperor>	 !log repool codfw ms-swift after container DB repairs T383053
[10:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:07] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6e
[10:26:52] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Bump mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164269 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris)
[10:27:26] <wikibugs>	 (03CR) 10Joal: [C:03+1] "LGTM! Should we also bump the available CPU resource for the pod?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164967 (https://phabricator.wikimedia.org/T398164) (owner: 10Btullis)
[10:27:37] <wikibugs>	 (03Merged) 10jenkins-bot: mw-debug: Specify upstream_retry_policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164270 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris)
[10:28:30] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[10:28:32] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:28:53] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[10:29:02] <wikibugs>	 (03PS25) 10Volans: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey)
[10:29:49] <wikibugs>	 (03CR) 10Cyndywikime: [C:03+1] Growth(enwiki): enable limiting Add a Link to new editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) (owner: 10Michael Große)
[10:30:06] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957935 (10MoritzMuehlenhoff)
[10:30:18] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts ganeti2021.codfw.wmnet
[10:33:19] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[10:33:36] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup1003.eqiad.wmnet: Renew puppet certificate - jynus@cumin1003
[10:33:38] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: external_clouds_vendors: add ImageSiftBot [puppet] - 10https://gerrit.wikimedia.org/r/1164975
[10:33:51] <logmsgbot>	 jmm@cumin1003 decommission (PID 3848085) is awaiting input
[10:34:04] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[10:35:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] external_clouds_vendors: add ImageSiftBot [puppet] - 10https://gerrit.wikimedia.org/r/1164975 (owner: 10Giuseppe Lavagetto)
[10:36:05] <icinga-wm>	 PROBLEM - Host backup1003 is DOWN: PING CRITICAL - Packet loss = 100%
[10:36:30] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.6e
[10:37:15] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[10:37:27] <icinga-wm>	 RECOVERY - Host backup1003 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[10:37:58] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.netbox
[10:38:09] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[10:40:30] <vgutierrez>	 !log depool cp7001
[10:40:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:18] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10957961 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium Hi,  As far as I can tell, you have access to the `analytics-platform-eng-adm...
[10:42:18] <logmsgbot>	 !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7001.magru.wmnet with reason: haproxy tetss
[10:43:34] <logmsgbot>	 jmm@cumin1003 decommission (PID 3848085) is awaiting input
[10:43:45] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[10:43:46] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10957980 (10Clement_Goubert)
[10:45:00] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10957983 (10Clement_Goubert) An old version of the L3 document was signed, could you sign the updated version as well, please?
[10:45:09] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10957984 (10Clement_Goubert) a:03Clement_Goubert
[10:45:54] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2021.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003"
[10:46:29] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10957990 (10Clement_Goubert) 05Open→03Stalled Stalled waiting for SSH k...
[10:46:35] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10957992 (10Clement_Goubert) p:05Triage→03Medium
[10:46:40] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2021.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003"
[10:46:40] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:46:41] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2021.codfw.wmnet
[10:47:27] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:47:37] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts ganeti2022.codfw.wmnet
[10:47:42] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet
[10:48:12] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10958013 (10Clement_Goubert) 05Open→03Stalled a:03Clement_Goubert Stalled waiting for confirmation of access from @DerHexer
[10:49:09] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10958019 (10Clement_Goubert) p:05Triage→03Medium
[10:49:47] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10958020 (10Clement_Goubert)
[10:50:00] <wikibugs>	 (03CR) 10Hnowlan: [C:04-1] "This will need an image bump. Will `0x` be in the $PATH by default?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 (owner: 10Jgiannelos)
[10:50:43] <logmsgbot>	 jmm@cumin1003 decommission (PID 3848941) is awaiting input
[10:50:45] <wikibugs>	 (03PS1) 10Cyndywikime: Growth: Configure higher impact module edit limits for english and test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164979 (https://phabricator.wikimedia.org/T341599)
[10:52:27] <logmsgbot>	 !log jmm@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2019.codfw.wmnet with reason: remove for decom
[10:53:28] <wikibugs>	 06SRE: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#10958046 (10Joe) p:05Triage→03High a:03Joe
[10:53:34] <wikibugs>	 (03CR) 10Ayounsi: "Awesome, thanks a lot! a few small comments inline." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1161448 (owner: 10Effie Mouzeli)
[10:54:55] <logmsgbot>	 jmm@cumin1003 decommission (PID 3848941) is awaiting input
[10:55:28] <wikibugs>	 (03CR) 10Urbanecm: "question: should we enable on testwiki first, or is this safe enough to deploy on enwiki together with testwiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) (owner: 10Michael Große)
[10:56:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[10:58:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ganeti2019 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1164980 (https://phabricator.wikimedia.org/T396590)
[10:58:33] <wikibugs>	 (03CR) 10Michael Große: "TBH, I half made-up the decision to also deploy it to testwiki in the first place, because I don't think that it makes sense to deploy it " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) (owner: 10Michael Große)
[10:58:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove ganeti2019 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1164980 (https://phabricator.wikimedia.org/T396590) (owner: 10Muehlenhoff)
[10:59:34] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp7001.magru.wmnet
[10:59:34] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7001.magru.wmnet
[10:59:47] <logmsgbot>	 !log taavi@cumin1003 START - Cookbook sre.dns.netbox
[11:00:35] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] service: remove ProxyFetch checks for kartotherian, thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1161485 (https://phabricator.wikimedia.org/T397148) (owner: 10Hnowlan)
[11:01:43] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01 [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185)
[11:02:51] <wikibugs>	 (03PS26) 10Volans: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey)
[11:03:52] <wikibugs>	 (03PS69) 10Cathal Mooney: sre.dns.netbox-future cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985)
[11:05:09] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.netbox
[11:05:17] <vgutierrez>	 !log repool cp7001
[11:05:19] <logmsgbot>	 taavi@cumin1003 netbox (PID 3851379) is awaiting input
[11:05:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:37] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:07:37] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2022.codfw.wmnet
[11:09:12] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission ganeti2021 / ganeti2022 - https://phabricator.wikimedia.org/T398182#10958090 (10MoritzMuehlenhoff)
[11:09:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] PyBalBGPUnstable: disable pint check for missing metrics [alerts] - 10https://gerrit.wikimedia.org/r/1164959 (https://phabricator.wikimedia.org/T396321) (owner: 10Tiziano Fogli)
[11:09:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] LibericaEtcdErrors: disable pint check for missing metrics [alerts] - 10https://gerrit.wikimedia.org/r/1164957 (https://phabricator.wikimedia.org/T396320) (owner: 10Tiziano Fogli)
[11:09:27] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01 [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185)
[11:09:56] <logmsgbot>	 !log taavi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add cloudvps ns-recursor v6 addresses - taavi@cumin1003"
[11:10:13] <logmsgbot>	 !log taavi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add cloudvps ns-recursor v6 addresses - taavi@cumin1003"
[11:10:13] <logmsgbot>	 !log taavi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:10:20] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo)
[11:10:34] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] haproxy,varnish: Introduce a host independent healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/1164449 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[11:10:47] <wikibugs>	 (03PS1) 10Majavah: Add include for WMCS codfw private service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1164983 (https://phabricator.wikimedia.org/T379282)
[11:11:52] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Add include for WMCS codfw private service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1164983 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[11:11:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:11:58] <logmsgbot>	 !log taavi@dns1004 START - running authdns-update
[11:12:18] <wikibugs>	 (03PS3) 10Jcrespo: bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01 [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185)
[11:12:22] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo)
[11:12:37] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove ganeti2019 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1164980 (https://phabricator.wikimedia.org/T396590)
[11:12:57] <urbanecm>	 !log Start GrowthExperiments:fixLinkRecommendationData --wiki=enwiki --db-table --force (T386867)
[11:13:00] <logmsgbot>	 !log taavi@dns1004 END - running authdns-update
[11:13:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:03] <stashbot>	 T386867: Add a Link: add "do not link" rule for country names (Q6256) on English Wikipedia - https://phabricator.wikimedia.org/T386867
[11:15:44] <wikibugs>	 (03CR) 10Jcrespo: "@Alex see here promised cleanup." [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo)
[11:16:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti2019 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1164980 (https://phabricator.wikimedia.org/T396590) (owner: 10Muehlenhoff)
[11:16:43] <jinxer-wm>	 FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2009:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[11:18:06] <wikibugs>	 (03PS1) 10Gmodena: Revert "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164984
[11:18:07] <hnowlan>	 jouncebot: nowandnext
[11:18:07] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 41 minute(s)
[11:18:07] <jouncebot>	 In 1 hour(s) and 41 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1300)
[11:18:17] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] service: remove ProxyFetch checks for kartotherian, thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1161485 (https://phabricator.wikimedia.org/T397148) (owner: 10Hnowlan)
[11:18:21] <gmodena>	 hey folks. We to do an emergency deployment for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1164984
[11:19:10] <gmodena>	 cc ^ taavi 
[11:20:27] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet
[11:20:44] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10958131 (10ops-monitoring-bot) Draining ganeti2020.codfw.wmnet of running VMs
[11:21:38] <gmodena>	 hey folks. We to do an emergency deployment for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1164984 cc / taavi 
[11:21:48] <gmodena>	 *need
[11:21:58] <Emperor>	 !log depool eqiad ms-swift for container DB repairs T383053
[11:22:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:04] <stashbot>	 T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053
[11:22:05] <claime>	 head's up vgutierrez slyngs (on-call) ^
[11:22:09] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=swift,name=eqiad
[11:22:22] <claime>	 And I think you can go ahead gmodena 
[11:22:38] <wikibugs>	 (03PS2) 10Btullis: Airflow-main: Increase parallelism and related values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164967 (https://phabricator.wikimedia.org/T398164)
[11:22:38] <taavi>	 not sure why you're only pinging me, 301 effie
[11:23:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164984 (owner: 10Gmodena)
[11:23:41] <wikibugs>	 (03CR) 10Jcrespo: "@volans could I get a review from you to alter: cumin/aliases.yaml.erb as suggested?" [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo)
[11:24:08] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164984 (owner: 10Gmodena)
[11:24:22] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1164984|Revert "Clean up EventBus and jobs config"]]
[11:26:34] <logmsgbot>	 !log phuedx@deploy1003 gmodena, phuedx: Backport for [[gerrit:1164984|Revert "Clean up EventBus and jobs config"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[11:26:43] <jinxer-wm>	 RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2009:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[11:29:33] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet
[11:29:47] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet
[11:30:09] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10958161 (10ops-monitoring-bot) Draining ganeti2020.codfw.wmnet of running VMs
[11:35:53] <wikibugs>	 (03CR) 10Btullis: "Done." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164967 (https://phabricator.wikimedia.org/T398164) (owner: 10Btullis)
[11:35:54] <gmodena>	 verified on the testserver. Syncing.
[11:36:01] <logmsgbot>	 !log phuedx@deploy1003 gmodena, phuedx: Continuing with sync
[11:38:50] <logmsgbot>	 !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[1063,1074,1083].eqiad.wmnet with reason: container db repair
[11:38:56] <wikibugs>	 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10958196 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b9fb1e08-c62e-4b34-b173-ffc58ee22ef8) set by mvernon@cumin2002 fo...
[11:39:18] <Emperor>	 !log repair wikipedia-commons-local-thumb.6b on ms-be10[63,74,83] T383053
[11:39:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:24] <stashbot>	 T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053
[11:40:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164948 (https://phabricator.wikimedia.org/T393705) (owner: 10KartikMistry)
[11:41:41] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164984|Revert "Clean up EventBus and jobs config"]] (duration: 17m 19s)
[11:43:03] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Airflow-main: Increase parallelism and related values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164967 (https://phabricator.wikimedia.org/T398164) (owner: 10Btullis)
[11:43:17] <phuedx>	 There's been no changes in the logs on mwlog1002
[11:43:24] <phuedx>	 (Good)
[11:43:56] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be[1063,1074,1083].eqiad.wmnet
[11:43:59] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[1063,1074,1083].eqiad.wmnet
[11:44:55] <wikibugs>	 (03Merged) 10jenkins-bot: Airflow-main: Increase parallelism and related values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164967 (https://phabricator.wikimedia.org/T398164) (owner: 10Btullis)
[11:45:27] <wikibugs>	 (03CR) 10Majavah: [C:03+2] dynamicproxy: Support IPv6-enabled recursors [puppet] - 10https://gerrit.wikimedia.org/r/1160135 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah)
[11:45:34] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge: nginx: Support IPv6-enabled recursors [puppet] - 10https://gerrit.wikimedia.org/r/1160136 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah)
[11:45:42] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[11:45:49] <logmsgbot>	 !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[1067,1070,1089].eqiad.wmnet with reason: container db repair
[11:45:55] <wikibugs>	 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10958214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=05b6b261-690d-46bf-a42d-e69d15adcfc8) set by mvernon@cumin2002 fo...
[11:45:56] <Emperor>	 !log repair wikipedia-commons-local-thumb.79 on ms-be10[70,67,89] T383053
[11:45:59] <hnowlan>	 Are ye all done with your deploys? I'd like to restart pybal if possible 
[11:46:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:03] <stashbot>	 T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053
[11:47:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet
[11:47:55] <phuedx>	 hnowlan: I'm done with my deploy
[11:48:19] <hnowlan>	 cool, ty
[11:50:00] <hnowlan>	 !log restarting pybal on lvs-secondary-eqiad 
[11:50:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add Joanna to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/1161896 (owner: 10Muehlenhoff)
[11:50:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:12] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be[1067,1070,1089].eqiad.wmnet
[11:50:14] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[1067,1070,1089].eqiad.wmnet
[11:50:19] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[11:51:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10958244 (10MoritzMuehlenhoff)
[11:52:02] <logmsgbot>	 !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[1066,1087,1090].eqiad.wmnet with reason: container db repair
[11:52:08] <Emperor>	 !log repair wikipedia-commons-local-thumb.b7 ms-be10[66,87,90] T383053
[11:52:09] <wikibugs>	 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10958245 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0077a8da-dd9e-44d3-b59a-d42061bdb69b) set by mvernon@cumin2002 fo...
[11:52:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:13] <stashbot>	 T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053
[11:52:20] <moritzm>	 !log installing mongo-c-driver security updates
[11:52:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:11] <wikibugs>	 (03CR) 10Jgiannelos: "Good call, its under node_modules." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 (owner: 10Jgiannelos)
[11:53:56] <wikibugs>	 (03CR) 10Volans: "Sure the change LGTM. I assumed that the olddirector is excluded as it's being decommissioned." [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo)
[11:55:45] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164986
[11:56:13] <logmsgbot>	 !log jmm@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti5004.eqsin.wmnet with reason: reimage
[11:56:18] <hnowlan>	 !log restarting pybal on A:lvs-low-traffic-eqiad 
[11:56:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:24] <wikibugs>	 (03PS3) 10Jgiannelos: mobileapps: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961
[11:57:30] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be[1066,1087,1090].eqiad.wmnet
[11:57:33] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[1066,1087,1090].eqiad.wmnet
[11:58:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5004.eqsin.wmnet with OS bookworm
[11:58:45] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10958264 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5004.eqsin.wmnet with OS bookworm
[11:59:15] <logmsgbot>	 !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[1078-1079,1085].eqiad.wmnet with reason: container db repair
[11:59:20] <Emperor>	 !log repair wikipedia-commons-local-thumb.d3 on ms-be10[78,79,85] T383053
[11:59:23] <wikibugs>	 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10958267 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=469f036d-84d0-4d5b-8246-e8056b4949ca) set by mvernon@cumin2002 fo...
[11:59:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:26] <stashbot>	 T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053
[11:59:47] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164986 (owner: 10PipelineBot)
[12:00:06] <wikibugs>	 (03CR) 10Volans: "It's being already removed in this change, my bad :)" [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo)
[12:00:18] <wikibugs>	 (03CR) 10Jgiannelos: "I updated the patch with the correct path and the new image." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 (owner: 10Jgiannelos)
[12:00:48] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mobileapps: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 (owner: 10Jgiannelos)
[12:00:56] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[12:01:09] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[12:01:12] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 (owner: 10Jgiannelos)
[12:03:11] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 (owner: 10Jgiannelos)
[12:04:15] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be[1078-1079,1085].eqiad.wmnet
[12:04:18] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[1078-1079,1085].eqiad.wmnet
[12:05:27] <Emperor>	 !log repair wikipedia-commons-local-thumb.ea on ms-be10[78,80] T383053
[12:05:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:33] <stashbot>	 T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053
[12:05:36] <logmsgbot>	 !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[1078,1080].eqiad.wmnet with reason: container db repair
[12:05:41] <wikibugs>	 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10958286 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a1ed9509-0228-4592-b2ad-7dd36a1c170f) set by mvernon@cumin2002 fo...
[12:06:23] <wikibugs>	 (03PS2) 10JMeybohm: k8s.pool-depool-cluster: Black format [cookbooks] - 10https://gerrit.wikimedia.org/r/1160816 (https://phabricator.wikimedia.org/T397148)
[12:06:24] <wikibugs>	 (03PS12) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148)
[12:07:52] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[12:08:10] <wikibugs>	 (03CR) 10Majavah: [C:03+2] natlog: Persist logs to /srv [puppet] - 10https://gerrit.wikimedia.org/r/1160104 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah)
[12:08:37] <wikibugs>	 (03CR) 10Jcrespo: "Thank, Volans, will wait for Alex or someone else for the rest of the changes to review." [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo)
[12:08:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10958288 (10VRiley-WMF)
[12:09:55] <wikibugs>	 (03CR) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm)
[12:10:34] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be[1078,1080].eqiad.wmnet
[12:10:36] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[1078,1080].eqiad.wmnet
[12:10:46] <wikibugs>	 (03PS4) 10Jcrespo: bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01 [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185)
[12:11:12] <Emperor>	 !log repool eqiad ms-swift after container DB repairs T383053
[12:11:13] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=swift,name=eqiad
[12:11:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:21] <stashbot>	 T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053
[12:11:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10958293 (10VRiley-WMF) @BTullis We have been trying to image an-worker1186 for a while now. Working with @Jhancock.wm on this for a while and it...
[12:11:40] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[12:11:40] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo)
[12:12:13] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ea
[12:13:31] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[12:14:26] <wikibugs>	 (03PS1) 10Brouberol: global_config: provision thanos-swift-{eqiad,codfw} external services [puppet] - 10https://gerrit.wikimedia.org/r/1164991 (https://phabricator.wikimedia.org/T398186)
[12:15:16] <wikibugs>	 (03PS1) 10Brouberol: airflow-ml: enable task pods to reach out to thanos-swift in both DCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164993 (https://phabricator.wikimedia.org/T398186)
[12:15:17] <wikibugs>	 (03PS1) 10Brouberol: airflow-ml: define a connection to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164994 (https://phabricator.wikimedia.org/T398186)
[12:15:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10958319 (10Jclark-ctr)
[12:17:48] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[12:17:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159568 (https://phabricator.wikimedia.org/T395360) (owner: 10Gergő Tisza)
[12:18:44] <wikibugs>	 (03PS1) 10Volans: images: improve extenal images support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164996 (https://phabricator.wikimedia.org/T397696)
[12:21:26] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ea
[12:22:07] <wikibugs>	 (03PS2) 10Volans: debmonitor: use the new endpoint for the check [puppet] - 10https://gerrit.wikimedia.org/r/1164485 (https://phabricator.wikimedia.org/T397696)
[12:22:07] <wikibugs>	 (03PS1) 10Volans: debmonitor: add link to docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/1164999 (https://phabricator.wikimedia.org/T397696)
[12:23:03] <wikibugs>	 (03CR) 10Btullis: [C:03+1] global_config: provision thanos-swift-{eqiad,codfw} external services [puppet] - 10https://gerrit.wikimedia.org/r/1164991 (https://phabricator.wikimedia.org/T398186) (owner: 10Brouberol)
[12:23:24] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-ml: enable task pods to reach out to thanos-swift in both DCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164993 (https://phabricator.wikimedia.org/T398186) (owner: 10Brouberol)
[12:24:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5004.eqsin.wmnet with reason: host reimage
[12:24:11] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-ml: define a connection to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164994 (https://phabricator.wikimedia.org/T398186) (owner: 10Brouberol)
[12:24:40] <wikibugs>	 (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: provision thanos-swift-{eqiad,codfw} external services [puppet] - 10https://gerrit.wikimedia.org/r/1164991 (https://phabricator.wikimedia.org/T398186) (owner: 10Brouberol)
[12:24:51] <wikibugs>	 (03CR) 10Urbanecm: "Doing both in the same window wouldn't provide additional opportunities for testing. If we are confident about the feature, deploying as-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) (owner: 10Michael Große)
[12:24:57] <wikibugs>	 (03CR) 10Elukey: [C:03+1] images: improve extenal images support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164996 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[12:25:09] <wikibugs>	 (03CR) 10Elukey: [C:03+1] debmonitor: add link to docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/1164999 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[12:25:18] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-ml: enable task pods to reach out to thanos-swift in both DCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164993 (https://phabricator.wikimedia.org/T398186) (owner: 10Brouberol)
[12:26:50] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[12:26:55] <icinga-wm>	 RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[12:27:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5004.eqsin.wmnet with reason: host reimage
[12:28:30] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[12:28:37] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[12:29:53] <wikibugs>	 (03CR) 10Volans: [C:03+2] images: improve extenal images support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164996 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[12:30:43] <wikibugs>	 (03Merged) 10jenkins-bot: images: improve extenal images support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164996 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[12:31:43] <wikibugs>	 (03CR) 10Clément Goubert: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm)
[12:32:24] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:32:41] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:34:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Use separate resource names for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1165003
[12:34:30] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply
[12:35:04] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply
[12:35:34] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165003 (owner: 10Muehlenhoff)
[12:36:27] <moritzm>	 !log installing qtbase-opensource-src security updates
[12:36:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:13] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply
[12:38:49] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-main/production on k8s-dse@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-main - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[12:38:53] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply
[12:39:13] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[12:40:27] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1165003 (owner: 10Muehlenhoff)
[12:40:32] <wikibugs>	 (03CR) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm)
[12:40:41] <wikibugs>	 (03PS3) 10JMeybohm: k8s.pool-depool-cluster: Black format [cookbooks] - 10https://gerrit.wikimedia.org/r/1160816 (https://phabricator.wikimedia.org/T397148)
[12:40:41] <wikibugs>	 (03PS13) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148)
[12:43:45] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[12:43:49] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release airflow-main/production on k8s-dse@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-main - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[12:45:45] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:46:33] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[12:47:01] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:49:10] <logmsgbot>	 !log jgreen@cumin1002 START - Cookbook sre.dns.netbox
[12:52:57] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-ml: define a connection to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164994 (https://phabricator.wikimedia.org/T398186) (owner: 10Brouberol)
[12:54:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5004.eqsin.wmnet with OS bookworm
[12:54:17] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10958470 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5004.eqsin.wmnet with OS bookworm completed: - ganeti5004 (**PASS*...
[12:54:43] <logmsgbot>	 jgreen@cumin1002 netbox (PID 1860130) is awaiting input
[12:55:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Use separate resource names for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1165003 (owner: 10Muehlenhoff)
[12:55:41] <logmsgbot>	 !log jgreen@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove host frnetmon1001.frack.eqiad.wmnet from DNS for decommissioning - jgreen@cumin1002"
[12:55:45] <logmsgbot>	 !log jgreen@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove host frnetmon1001.frack.eqiad.wmnet from DNS for decommissioning - jgreen@cumin1002"
[12:55:45] <logmsgbot>	 !log jgreen@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:56:18] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frnetmon1001 - https://phabricator.wikimedia.org/T398079#10958475 (10Jgreen) a:05Jgreen→03None
[12:56:49] <icinga-wm>	 PROBLEM - very high load average likely xfs on thanos-be1007 is CRITICAL: CRITICAL - load average: 112.71, 101.97, 93.48 https://wikitech.wikimedia.org/wiki/Swift
[12:58:00] <wikibugs>	 (03PS1) 10Volans: Upstream release v0.6.4 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165006
[12:58:26] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[12:58:49] <icinga-wm>	 PROBLEM - very high load average likely xfs on thanos-be1009 is CRITICAL: CRITICAL - load average: 107.95, 101.43, 96.22 https://wikitech.wikimedia.org/wiki/Swift
[12:58:52] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ganeti5004
[13:00:05] <jouncebot>	 Urbanecm and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1300).
[13:00:05] <jouncebot>	 phuedx, LD, and sd0001: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <logmsgbot>	 !log jmm@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti5004
[13:00:31] <LD>	  partyt time \O/
[13:00:37] <sd0001>	 o/
[13:01:06] <phuedx>	 o/
[13:01:42] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[13:02:00] <phuedx>	 I can deploy mine. LD, sd0001: Can you deploy yours?
[13:02:16] <LD>	 yep
[13:02:23] <LD>	 i mean no
[13:02:25] <sd0001>	 I don't have access, can you deploy mine too?
[13:02:51] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Use default logging level on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165009
[13:02:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mobileapps: Use default logging level on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165009 (owner: 10Jgiannelos)
[13:03:22] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:03:54] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:03:56] <wikibugs>	 (03PS2) 10Jgiannelos: mobileapps: Use default logging level on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165009
[13:04:07] <LD>	 By the way, neither of our changes has proper tests, so it's an all-in
[13:04:28] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6b
[13:04:49] <icinga-wm>	 PROBLEM - very high load average likely xfs on thanos-be1007 is CRITICAL: CRITICAL - load average: 105.24, 100.36, 95.66 https://wikitech.wikimedia.org/wiki/Swift
[13:05:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164388 (https://phabricator.wikimedia.org/T397611) (owner: 10Phuedx)
[13:05:51] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:05:57] <phuedx>	 LD: Do you have a way to test your change when it's on the test servers?
[13:06:25] <wikibugs>	 (03Merged) 10jenkins-bot: ext-EventStreamConfig: Remove eventlogging_TwoColConflict* streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164388 (https://phabricator.wikimedia.org/T397611) (owner: 10Phuedx)
[13:06:38] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1164388|ext-EventStreamConfig: Remove eventlogging_TwoColConflict* streams (T397611)]]
[13:06:43] <stashbot>	 T397611: Decommission the TwoColConflictConflict and -Exit instruments - https://phabricator.wikimedia.org/T397611
[13:07:04] <LD>	 phuedx i don't think so
[13:07:19] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[13:07:39] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:07:49] <icinga-wm>	 PROBLEM - very high load average likely xfs on thanos-be1009 is CRITICAL: CRITICAL - load average: 111.00, 102.57, 98.82 https://wikitech.wikimedia.org/wiki/Swift
[13:08:04] <wikibugs>	 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10958533 (10MatthewVernon) These corrupt DBs have all been repaired now.
[13:08:31] <logmsgbot>	 !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1164388|ext-EventStreamConfig: Remove eventlogging_TwoColConflict* streams (T397611)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:08:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165006 (owner: 10Volans)
[13:09:25] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[13:09:36] <phuedx>	 Tested by hitting the streamconfigs API. The TwoColConflict* stream configs are not present in the response on the test server
[13:09:43] <logmsgbot>	 !log phuedx@deploy1003 phuedx: Continuing with sync
[13:10:01] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[13:10:11] <wikibugs>	 (03PS27) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826
[13:12:17] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[13:12:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for qtbase-opensource-src [puppet] - 10https://gerrit.wikimedia.org/r/1165012
[13:12:55] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:13:18] <wikibugs>	 (03PS1) 10Eevans: sessionstore1004: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165013 (https://phabricator.wikimedia.org/T391544)
[13:13:19] <wikibugs>	 (03PS1) 10Eevans: sessionstore1004: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165014 (https://phabricator.wikimedia.org/T391544)
[13:13:21] <wikibugs>	 (03PS1) 10Eevans: sessionstore1005: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165015 (https://phabricator.wikimedia.org/T391544)
[13:13:22] <wikibugs>	 (03PS1) 10Eevans: sessionstore1005: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165016 (https://phabricator.wikimedia.org/T391544)
[13:13:24] <wikibugs>	 (03PS1) 10Eevans: sessionstore1006: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165017 (https://phabricator.wikimedia.org/T391544)
[13:13:26] <wikibugs>	 (03PS1) 10Eevans: sessionstore1006: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165018 (https://phabricator.wikimedia.org/T391544)
[13:13:28] <wikibugs>	 (03PS1) 10Eevans: sessionstore: preseed eqiad servers for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1165019 (https://phabricator.wikimedia.org/T391544)
[13:14:04] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.6b
[13:14:07] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.79
[13:14:31] <phuedx>	 sd0001: I see the note about query performance against your patch. Has the issue been taken care of?
[13:15:11] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164388|ext-EventStreamConfig: Remove eventlogging_TwoColConflict* streams (T397611)]] (duration: 08m 32s)
[13:15:16] <stashbot>	 T397611: Decommission the TwoColConflictConflict and -Exit instruments - https://phabricator.wikimedia.org/T397611
[13:15:30] <sd0001>	 phuedx: yes, the query was tested on mwdebug by musikanimal and it took only 1 minute
[13:16:05] <sd0001>	 (it's run in a cron job, not in webrequests, so 1 min is fine)
[13:16:08] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi)
[13:16:45] <logmsgbot>	 jclark@cumin1002 provision (PID 1884572) is awaiting input
[13:16:49] <icinga-wm>	 PROBLEM - very high load average likely xfs on thanos-be1007 is CRITICAL: CRITICAL - load average: 104.40, 101.15, 98.76 https://wikitech.wikimedia.org/wiki/Swift
[13:17:10] <wikibugs>	 (03CR) 10Jgiannelos: [C:04-1] "Even with this 0x is spamming with debug logs. Looking at it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165009 (owner: 10Jgiannelos)
[13:17:35] <wikibugs>	 (03CR) 10Cyndywikime: "This patch is ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164979 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime)
[13:17:49] <icinga-wm>	 PROBLEM - very high load average likely xfs on thanos-be1009 is CRITICAL: CRITICAL - load average: 100.88, 100.01, 99.29 https://wikitech.wikimedia.org/wiki/Swift
[13:17:58] <wikibugs>	 (03PS1) 10Tiziano Fogli: pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091)
[13:18:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091) (owner: 10Tiziano Fogli)
[13:18:32] <vgutierrez>	 thanos-be cluster struggling is expected?
[13:18:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add library hint for qtbase-opensource-src [puppet] - 10https://gerrit.wikimedia.org/r/1165012 (owner: 10Muehlenhoff)
[13:19:35] <wikibugs>	 (03PS2) 10Tiziano Fogli: pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091)
[13:19:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161478 (https://phabricator.wikimedia.org/T397063) (owner: 10LD)
[13:20:26] <wikibugs>	 (03Merged) 10jenkins-bot: frwiki: allow bureaucrats to assign and remove temporary-account-viewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161478 (https://phabricator.wikimedia.org/T397063) (owner: 10LD)
[13:20:40] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1161478|frwiki: allow bureaucrats to assign and remove temporary-account-viewer group (T397063)]]
[13:20:46] <stashbot>	 T397063: frwiki: allow bureaucrats to assign and remove temporary-account-viewer group - https://phabricator.wikimedia.org/T397063
[13:20:49] <icinga-wm>	 PROBLEM - very high load average likely xfs on thanos-be1009 is CRITICAL: CRITICAL - load average: 105.14, 100.81, 99.62 https://wikitech.wikimedia.org/wiki/Swift
[13:21:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1165013 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[13:21:28] <wikibugs>	 (03PS1) 10Brouberol: airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165021 (https://phabricator.wikimedia.org/T398164)
[13:21:49] <icinga-wm>	 PROBLEM - very high load average likely xfs on thanos-be1007 is CRITICAL: CRITICAL - load average: 105.78, 100.90, 98.99 https://wikitech.wikimedia.org/wiki/Swift
[13:21:55] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:22:34] <logmsgbot>	 !log phuedx@deploy1003 phuedx, wpld: Backport for [[gerrit:1161478|frwiki: allow bureaucrats to assign and remove temporary-account-viewer group (T397063)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:23:20] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.79
[13:23:21] <phuedx>	 LD: I looked at your change. It's been OK'd by Tchanders and Dreamy_Jazz. Unless you know a bureaucrat on frwiki, it'll be quite hard to test on the test servers :)
[13:23:22] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.99
[13:23:41] <wikibugs>	 (03PS3) 10Tiziano Fogli: pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091)
[13:24:10] <LD>	 phuedx I'm not ^^
[13:25:02] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165021 (https://phabricator.wikimedia.org/T398164) (owner: 10Brouberol)
[13:25:04] <LD>	 but it should work fine, coreperm stuff anyway
[13:25:33] <logmsgbot>	 !log phuedx@deploy1003 phuedx, wpld: Continuing with sync
[13:26:34] <logmsgbot>	 jclark@cumin1002 reimage (PID 1894712) is awaiting input
[13:27:19] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ganeti5004
[13:27:47] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti5004
[13:28:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10958586 (10MoritzMuehlenhoff)
[13:29:01] <phuedx>	 sd0001: I've had a quick look at the patch that's referenced in the commit. The query is run on every request to /wiki/Special:GadgetUsage and not via a cron job
[13:29:25] <sd0001>	 no, it's cached as part of the QueryPage system if MiserMode is off
[13:29:36] <sd0001>	 * I mean if MiserMode is on
[13:29:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet
[13:29:56] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ganeti5004
[13:29:58] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:30:19] <logmsgbot>	 !log jmm@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti5004
[13:31:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.09% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:31:16] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1161478|frwiki: allow bureaucrats to assign and remove temporary-account-viewer group (T397063)]] (duration: 10m 36s)
[13:31:22] <stashbot>	 T397063: frwiki: allow bureaucrats to assign and remove temporary-account-viewer group - https://phabricator.wikimedia.org/T397063
[13:32:04] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good note that even for Cloud, we made a few changes to the ntpsec.conf file, such as removing the iburst option. See Id62f3bf2a4d11" [puppet] - 10https://gerrit.wikimedia.org/r/1164970 (https://phabricator.wikimedia.org/T398099) (owner: 10Majavah)
[13:32:17] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165021 (https://phabricator.wikimedia.org/T398164) (owner: 10Brouberol)
[13:33:00] <sd0001>	 phuedx: any concerns? see also this comment by Ladsgroup about how the query is run on Wikimedia infra: https://phabricator.wikimedia.org/T121516#10916810 (unrelated ticket, but same topic)
[13:34:28] <wikibugs>	 (03PS1) 10Slyngshede: Lock dulwich dependency at 0.22.1 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1165025 (https://phabricator.wikimedia.org/T397300)
[13:35:27] <phuedx>	 sd0001: Thanks for that. Reading :)
[13:35:50] <wikibugs>	 (03PS1) 10JMeybohm: sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148)
[13:35:55] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.99
[13:35:58] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b7
[13:36:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.95% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:36:25] <wikibugs>	 (03CR) 10Slyngshede: "The issue can be reproduced by the following code:" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1165025 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede)
[13:37:54] <LD>	 Alright, thanks for the party, phuedx
[13:38:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet
[13:38:26] <wikibugs>	 (03CR) 10Ssingh: hiera: enable exporting prom metrics from doh1001 for anycast-hc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[13:40:02] <Reedy>	 jouncebot: nowandnext
[13:40:02] <jouncebot>	 For the next 0 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1300)
[13:40:03] <jouncebot>	 In 0 hour(s) and 49 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430)
[13:40:34] <phuedx>	 sd0001: Just to confirm: The query has been optimized. The optimized query has been deployed and already been run (it was merged ~10 days ago)
[13:40:37] <wikibugs>	 (03PS2) 10Slyngshede: Lock dulwich dependency at 0.22.1 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1165025 (https://phabricator.wikimedia.org/T397300)
[13:40:57] <phuedx>	 The config change is to use the cached results of that optimized query on enwiki?
[13:41:00] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[13:41:51] <logmsgbot>	 !log cmooney@cumin1003 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[13:41:58] <sd0001>	 phuedx: config change is to trigger the optimized query to run
[13:42:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1
[13:42:46] <sd0001>	 (it will run the next time maintenance/updateSpecialPages.php is run)
[13:43:10] <sd0001>	 until that time, Special:GadgetUsage will show some dummy values for active users as data won't be available
[13:43:28] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough
[13:43:38] <phuedx>	 I see
[13:44:05] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[13:44:10] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Fix staging config path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165028
[13:44:12] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: Use default logging level on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165009 (owner: 10Jgiannelos)
[13:44:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.73% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:44:49] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.b7
[13:44:52] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.bb
[13:44:58] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10958648 (10MoritzMuehlenhoff)
[13:45:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164490 (https://phabricator.wikimedia.org/T397454) (owner: 10SD0001)
[13:46:33] <phuedx>	 sd0001: It goes without saying: Please don't land the removal of the config flag in the Gadgets extension until you and Data Persistence are satisfied that the query is performant enough :)
[13:46:41] <wikibugs>	 (03Merged) 10jenkins-bot: Re-enable wgSpecialGadgetUsageActiveUsers for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164490 (https://phabricator.wikimedia.org/T397454) (owner: 10SD0001)
[13:46:54] <sd0001>	 phuedx: sure
[13:46:57] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1164490|Re-enable wgSpecialGadgetUsageActiveUsers for enwiki (T397454)]]
[13:47:04] <stashbot>	 T397454: Show active user stats on Special:GadgetUsage in English Wikipedia - https://phabricator.wikimedia.org/T397454
[13:47:20] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test execution time - cmooney@cumin1003"
[13:47:25] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test execution time - cmooney@cumin1003"
[13:47:25] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:47:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1164999 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[13:48:35] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox and (A:dnsbox)
[13:48:56] <logmsgbot>	 !log phuedx@deploy1003 phuedx, sd: Backport for [[gerrit:1164490|Re-enable wgSpecialGadgetUsageActiveUsers for enwiki (T397454)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:49:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:49:19] <phuedx>	 sd0001: Please check Special:GadgetUsage on the test servers :)
[13:49:42] <sd0001>	 phuedx: looks good, can see the column for active users showing up
[13:49:59] <logmsgbot>	 !log phuedx@deploy1003 phuedx, sd: Continuing with sync
[13:50:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:51:05] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Lock dulwich dependency at 0.22.1 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1165025 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede)
[13:51:19] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "Thanks for the reminder; I will once I merge this." [puppet] - 10https://gerrit.wikimedia.org/r/1163858 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[13:55:26] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164490|Re-enable wgSpecialGadgetUsageActiveUsers for enwiki (T397454)]] (duration: 08m 28s)
[13:55:32] <stashbot>	 T397454: Show active user stats on Special:GadgetUsage in English Wikipedia - https://phabricator.wikimedia.org/T397454
[13:55:39] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough
[13:55:42] <phuedx>	 !log UTC afternoon backport window finished
[13:55:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:36] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.bb
[13:56:39] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.d3
[13:56:53] <sd0001>	 phuedx: thanks!
[13:57:28] <phuedx>	 sd0001: yw. Thanks for pointing me at the QueryPage subsystem. TIL!
[13:57:51] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:58:01] <wikibugs>	 (03PS14) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316
[13:58:18] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1186.eqiad.wmnet with OS bullseye
[13:58:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10958700 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1186.eqiad.wmnet with OS b...
[13:58:56] <XioNoX>	 !log push pfw policies - T397875
[13:59:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:21] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[14:00:31] <Reedy>	 jouncebot: nowandnext
[14:00:32] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 29 minute(s)
[14:00:32] <jouncebot>	 In 0 hour(s) and 29 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430)
[14:00:58] <sd0001>	 phuedx: if you're comfortable, can you also run the script so that the actual counts of active users shows up?
[14:01:06] <sd0001>	 see https://wikitech.wikimedia.org/wiki/Regenerate_cached_special_pages - although the instructions are old, command would need to be adjusted to use mwscript-k8s
[14:01:26] <wikibugs>	 (03PS2) 10JMeybohm: sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148)
[14:02:12] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox
[14:03:05] <logmsgbot>	 jmm@cumin2002 addnode (PID 3422069) is awaiting input
[14:04:30] <wikibugs>	 (03PS3) 10JMeybohm: sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148)
[14:04:52] <vgutierrez>	 !log rolling restart of pybal on lvs201[34]
[14:04:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:58] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.d3
[14:05:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:05:48] <icinga-wm>	 PROBLEM - very high load average likely xfs on thanos-be1007 is CRITICAL: CRITICAL - load average: 100.76, 100.05, 98.18 https://wikitech.wikimedia.org/wiki/Swift
[14:06:16] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudcephosd200[567]-dev service implementation - https://phabricator.wikimedia.org/T397237#10958743 (10Andrew) p:05High→03Medium
[14:06:30] <phuedx>	 sd0001: I have to run an errand before my next meeting. Could you ask in #wikimedia-data-persistence?
[14:06:34] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[14:07:12] <sd0001>	 phuedx: okay!
[14:07:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1
[14:08:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet
[14:08:23] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti5005.eqsin.wmnet
[14:08:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet
[14:08:49] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10958758 (10ops-monitoring-bot) Draining ganeti5005.eqsin.wmnet of running VMs
[14:08:57] <urandom>	 !decommissioning Cassandra/sessionstore1004-a — T391544
[14:08:57] <stashbot>	 T391544: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544
[14:09:28] <wikibugs>	 (03CR) 10Eevans: [C:03+2] sessionstore1004: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165013 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[14:09:32] <vgutierrez>	 urandom: !log?
[14:09:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet
[14:09:58] <wikibugs>	 (03CR) 10Michael Große: Growth: Configure higher impact module edit limits for english and test wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164979 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime)
[14:11:00] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[14:11:14] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) (owner: 10Michael Große)
[14:12:20] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm)
[14:13:07] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox-future Generating DNS records from Netbox and syncing changes - demo run of new cookbook - cmooney@cumin1003
[14:13:14] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] service: Target upload.wm.o on upload-https healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1164466 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[14:13:32] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:13:46] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[14:14:03] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox-future (exit_code=0) Generating DNS records from Netbox and syncing changes - demo run of new cookbook - cmooney@cumin1003
[14:14:25] <wikibugs>	 (03PS1) 10Jelto: gitlab: remove git_data_dirs setting [puppet] - 10https://gerrit.wikimedia.org/r/1165033 (https://phabricator.wikimedia.org/T394382)
[14:15:21] <wikibugs>	 (03PS1) 10Brouberol: airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165034 (https://phabricator.wikimedia.org/T398164)
[14:16:14] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6108/co" [puppet] - 10https://gerrit.wikimedia.org/r/1165033 (https://phabricator.wikimedia.org/T394382) (owner: 10Jelto)
[14:17:14] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm)
[14:18:31] <wikibugs>	 (03PS4) 10Andrew Bogott: Openstack designate: use 'designate' service user instead of novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1164491 (https://phabricator.wikimedia.org/T273150)
[14:18:33] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164491 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott)
[14:18:46] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs4010.ulsfo.wmnet} and A:liberica (T394484)
[14:18:53] <stashbot>	 T394484: Consider using a dedicated TLS certificate for upload.w.o - https://phabricator.wikimedia.org/T394484
[14:19:04] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs4010.ulsfo.wmnet} and A:liberica (T394484)
[14:19:30] <jinxer-wm>	 FIRING: LibericaStaleConfig: Liberica instance lvs4010 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=ulsfo&var-instance=lvs4010 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[14:19:55] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2003.codfw.wmnet with reason: Maintenance and reboot
[14:20:28] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165034 (https://phabricator.wikimedia.org/T398164) (owner: 10Brouberol)
[14:21:10] <wikibugs>	 (03PS7) 10Tiziano Fogli: pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091)
[14:23:04] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1186.eqiad.wmnet with reason: host reimage
[14:23:47] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore1004.eqiad.wmnet with OS bullseye
[14:23:51] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[14:23:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet
[14:23:59] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10958835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1004.e...
[14:24:02] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10958836 (10ops-monitoring-bot) Draining ganeti5005.eqsin.wmnet of running VMs
[14:24:30] <jinxer-wm>	 RESOLVED: LibericaStaleConfig: Liberica instance lvs4010 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=ulsfo&var-instance=lvs4010 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[14:25:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165034 (https://phabricator.wikimedia.org/T398164) (owner: 10Brouberol)
[14:26:42] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Fix acl checks for unique path healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/1165035 (https://phabricator.wikimedia.org/T394484)
[14:26:45] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1186.eqiad.wmnet with reason: host reimage
[14:27:05] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165035 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[14:27:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:27:18] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:27:23] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test execution time - cmooney@cumin1003"
[14:27:27] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test execution time - cmooney@cumin1003"
[14:27:28] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:28:16] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:28:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:29:46] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Yes I made it in the review as well. Adds up." [puppet] - 10https://gerrit.wikimedia.org/r/1165035 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[14:29:55] <wikibugs>	 (03CR) 10Brouberol: airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165034 (https://phabricator.wikimedia.org/T398164) (owner: 10Brouberol)
[14:30:06] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430)
[14:30:08] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Fix acl checks for unique path healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/1165035 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[14:32:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:36:09] <wikibugs>	 (03PS2) 10Brouberol: airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165034 (https://phabricator.wikimedia.org/T398164)
[14:36:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://phabricator.wikimedia.org/T398024#10958898 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm errors ceased on the 27th. likely issues with xcon owned...
[14:36:51] <wikibugs>	 (03PS3) 10Brouberol: airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165034 (https://phabricator.wikimedia.org/T398164)
[14:36:51] <wikibugs>	 (03CR) 10Eevans: [C:03+2] sessionstore1004: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165014 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[14:39:36] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet
[14:40:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:40:38] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1004.eqiad.wmnet with reason: host reimage
[14:42:21] <wikibugs>	 (03CR) 10Herron: [C:03+1] pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091) (owner: 10Tiziano Fogli)
[14:43:38] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165034 (https://phabricator.wikimedia.org/T398164) (owner: 10Brouberol)
[14:44:00] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[14:44:10] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:44:24] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1004.eqiad.wmnet with reason: host reimage
[14:44:28] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:44:29] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1186.eqiad.wmnet with OS bullseye
[14:44:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10958918 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1186.eqiad.wmnet with OS bulls...
[14:45:37] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:45:56] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:46:17] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[14:46:45] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[14:46:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10958938 (10Jclark-ctr) It looks like the system had a bad DAC cable. While running the provisioning script, it prompted me to select the PXE por...
[14:47:00] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm
[14:47:05] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox and (A:dnsbox)
[14:47:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10958939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2005.codfw.wmnet with OS bookworm
[14:47:42] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:48:21] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:49:01] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:49:14] <sukhe>	 !log running dummy authdns-update after service restarts on A:dnsbox
[14:49:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:48] <icinga-wm>	 RECOVERY - very high load average likely xfs on thanos-be1007 is OK: OK - load average: 57.93, 68.68, 79.40 https://wikitech.wikimedia.org/wiki/Swift
[14:49:58] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:50:53] <wikibugs>	 (03CR) 10Ssingh: "Very minor comments in-line, nothing related to functionality:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[14:50:59] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:51:48] <icinga-wm>	 RECOVERY - very high load average likely xfs on thanos-be1009 is OK: OK - load average: 54.45, 68.38, 79.76 https://wikitech.wikimedia.org/wiki/Swift
[14:52:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091) (owner: 10Tiziano Fogli)
[14:54:14] <logmsgbot>	 jhancock@cumin1003 provision (PID 3874714) is awaiting input
[14:54:47] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091) (owner: 10Tiziano Fogli)
[14:55:06] <wikibugs>	 (03PS2) 10Hnowlan: changeprop: add missing total_delay histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164964 (https://phabricator.wikimedia.org/T397970)
[14:55:57] <wikibugs>	 (03PS3) 10Hnowlan: changeprop: add missing total_delay histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164964 (https://phabricator.wikimedia.org/T397970)
[14:56:18] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:57:59] <wikibugs>	 (03CR) 10Scott French: [C:03+1] changeprop: add missing total_delay histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164964 (https://phabricator.wikimedia.org/T397970) (owner: 10Hnowlan)
[14:58:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:58:31] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10958984 (10MoritzMuehlenhoff)
[14:58:54] <wikibugs>	 (03PS4) 10Hnowlan: changeprop: add missing total_delay histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164964 (https://phabricator.wikimedia.org/T397970)
[15:03:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:03:26] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1004.eqiad.wmnet with OS bullseye
[15:03:37] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10959001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1004.eqiad...
[15:04:37] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] changeprop: add missing total_delay histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164964 (https://phabricator.wikimedia.org/T397970) (owner: 10Hnowlan)
[15:05:42] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reboot-single for host sessionstore1004.eqiad.wmnet
[15:06:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:06:38] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: add missing total_delay histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164964 (https://phabricator.wikimedia.org/T397970) (owner: 10Hnowlan)
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:08] <hnowlan>	 phuedx: there's been a fairly significant increase in worker saturation in mw-web since your patch was rolled out, do you know whether it might be the cause? https://grafana.wikimedia.org/goto/GW0OhZsHg?orgId=1 
[15:10:07] <logmsgbot>	 !log jmm@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2020.codfw.wmnet with reason: remove for decom
[15:10:38] <phuedx>	 hnowlan: I deployed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1164388 at around that time. I don't see how removing a config for an inactive event stream would increase worker saturation
[15:11:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.94% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:11:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ganeti2020 from Ganeti/codfw [puppet] - 10https://gerrit.wikimedia.org/r/1165042 (https://phabricator.wikimedia.org/T396590)
[15:11:22] <hnowlan>	 phuedx: yeah, seems relatively unlikely 
[15:12:03] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1004.eqiad.wmnet
[15:12:06] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Dumps_v1: Disable the sync job that publishes from dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/1164150 (https://phabricator.wikimedia.org/T397848) (owner: 10Btullis)
[15:12:15] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Dumps_v1: Stop updating dumps monitor HTML/JSON from the legacy system [puppet] - 10https://gerrit.wikimedia.org/r/1164157 (https://phabricator.wikimedia.org/T397848) (owner: 10Btullis)
[15:12:15] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup2003.codfw.wmnet: Renew puppet certificate - jynus@cumin1003
[15:12:36] <phuedx>	 hnowlan: Same for the config change I deployed immediately after it – adding the ability for bureaucrats on frwiki to add/remove a group via the wgAdd- and wgRemoveGroup variables
[15:12:38] <phuedx>	 Seems unlikely
[15:12:42] <phuedx>	 But the timing is suspect
[15:13:41] <hnowlan>	 phuedx: seems there might be an external factor 
[15:15:27] <phuedx>	 hnowlan: Kinda agree. Worker saturation also seemed to increase during the backport window this morning (UTC)
[15:16:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:16:52] <urandom>	 !bootstrapping Cassandra/sessionstore1004-a — T391544
[15:16:52] <stashbot>	 T391544: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544
[15:19:34] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:23:50] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] LibericaEtcdErrors: disable pint check for missing metrics [alerts] - 10https://gerrit.wikimedia.org/r/1164957 (https://phabricator.wikimedia.org/T396320) (owner: 10Tiziano Fogli)
[15:23:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Openstack designate: use 'designate' service user instead of novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1164491 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott)
[15:24:05] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] PyBalBGPUnstable: disable pint check for missing metrics [alerts] - 10https://gerrit.wikimedia.org/r/1164959 (https://phabricator.wikimedia.org/T396321) (owner: 10Tiziano Fogli)
[15:24:31] <wikibugs>	 (03PS2) 10Eevans: sessionstore1005: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165015 (https://phabricator.wikimedia.org/T391544)
[15:24:31] <wikibugs>	 (03PS2) 10Eevans: sessionstore1005: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165016 (https://phabricator.wikimedia.org/T391544)
[15:24:31] <wikibugs>	 (03PS2) 10Eevans: sessionstore1006: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165017 (https://phabricator.wikimedia.org/T391544)
[15:24:31] <wikibugs>	 (03PS2) 10Eevans: sessionstore1006: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165018 (https://phabricator.wikimedia.org/T391544)
[15:24:32] <wikibugs>	 (03PS2) 10Eevans: sessionstore: preseed eqiad servers for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1165019 (https://phabricator.wikimedia.org/T391544)
[15:24:39] <wikibugs>	 07sre-alert-triage, 06SRE Observability, 06Traffic, 13Patch-For-Review: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T396321#10959156 (10tappof) 05Open→03Resolved a:03tappof
[15:24:49] <wikibugs>	 07sre-alert-triage, 06SRE Observability, 06Traffic, 13Patch-For-Review: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T396320#10959159 (10tappof) 05Open→03Resolved a:03tappof
[15:27:27] <wikibugs>	 06SRE: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#10959192 (10Joe)
[15:28:02] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM, I'd prefer using epp templates (so it does type check, empty etc.) but I understand it's a bit more tedious." [puppet] - 10https://gerrit.wikimedia.org/r/1160104 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah)
[15:29:16] <wikibugs>	 (03CR) 10Eevans: [C:03+2] sessionstore1005: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165015 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[15:29:45] <wikibugs>	 (03PS2) 10Majavah: natlog: Use a separate journald namespace with no storage [puppet] - 10https://gerrit.wikimedia.org/r/1160117 (https://phabricator.wikimedia.org/T273734)
[15:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430)
[15:30:05] <jouncebot>	 jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1530).
[15:30:18] <wikibugs>	 06SRE, 06Data-Engineering: Include accept-language header in turnilo/superset - https://phabricator.wikimedia.org/T398213 (10Joe) 03NEW
[15:37:14] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Remove backup1002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165050 (https://phabricator.wikimedia.org/T398210)
[15:37:15] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212)
[15:37:36] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212)
[15:37:56] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Remove backup1002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165050 (https://phabricator.wikimedia.org/T398210)
[15:37:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) (owner: 10Jcrespo)
[15:38:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) (owner: 10Jcrespo)
[15:38:16] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Follow up on lists.wm.o TLS usage - https://phabricator.wikimedia.org/T398018#10959306 (10LSobanski) @Vgutierrez who would be doing the work listed in the bullet points, you or us?
[15:38:45] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Remove backup1002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165050 (https://phabricator.wikimedia.org/T398210)
[15:39:38] <wikibugs>	 (03PS4) 10Jcrespo: dbbackups: Remove backup1002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165050 (https://phabricator.wikimedia.org/T398210)
[15:40:08] <wikibugs>	 (03PS1) 10Majavah: Remove root keys for former staff [labs/private] - 10https://gerrit.wikimedia.org/r/1165052
[15:40:12] <wikibugs>	 (03PS5) 10Jcrespo: dbbackups: Remove backup1002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165050 (https://phabricator.wikimedia.org/T398210)
[15:40:54] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212)
[15:41:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) (owner: 10Jcrespo)
[15:41:37] <wikibugs>	 (03PS4) 10Jcrespo: dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212)
[15:42:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) (owner: 10Jcrespo)
[15:42:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10959333 (10cmooney)
[15:43:06] <wikibugs>	 (03PS5) 10Jcrespo: dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212)
[15:44:08] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:44:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netbox: Netbox script for adding secondary IPs - https://phabricator.wikimedia.org/T378730#10959364 (10cmooney) a:03cmooney @Eevans sorry this one escaped me somehow let me take a look, agreed it seems there is something wrong here.
[15:45:42] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[15:47:31] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.decommission for hosts backup1002.eqiad.wmnet
[15:50:07] <logmsgbot>	 jhancock@cumin1003 provision (PID 3881940) is awaiting input
[15:52:35] <wikibugs>	 06SRE, 06cloud-services-team, 10DNS, 06Infrastructure-Foundations, and 2 others: Cloud: define relationship between wikimediacloud.org domain, CIDR prefixes and netbox automation - https://phabricator.wikimedia.org/T266331#10959423 (10ayounsi) 05Open→03Declined Closing for now, please reopen if nee...
[15:53:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:53:32] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[15:53:47] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.dns.netbox
[15:56:58] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003"
[15:57:24] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003"
[15:57:24] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:57:25] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts backup1002.eqiad.wmnet
[15:59:14] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Remove backup1002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165050 (https://phabricator.wikimedia.org/T398210) (owner: 10Jcrespo)
[15:59:30] <jinxer-wm>	 FIRING: [2x] LibericaStaleConfig: Liberica instance lvs6002 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig  - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[16:00:01] <jynus>	 urandom:sessionstore1005: reimage for JBOD configuration (c95b94c7ee)
[16:00:18] <jynus>	 should I send it or wait?
[16:01:58] <vgutierrez>	 ^ liberica|pybal alerts are expected
[16:02:10] <jynus>	 I'm 99% sure this was something you were working in, urandom , but want to make sure the merging was intended
[16:04:30] <jinxer-wm>	 FIRING: [5x] LibericaStaleConfig: Liberica instance lvs3009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig  - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[16:06:56] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs6002.drmrs.wmnet,lvs5005.eqsin.wmnet,lvs3009.esams.wmnet,lvs7002.magru.wmnet,lvs4009.ulsfo.wmnet} and A:liberica (T394484)
[16:06:58] <urandom>	 jynus: oh I'm sorry
[16:07:02] <stashbot>	 T394484: Consider using a dedicated TLS certificate for upload.w.o - https://phabricator.wikimedia.org/T394484
[16:07:03] <urandom>	 yes, it can be merged
[16:07:10] <urandom>	 ...thought I had
[16:07:33] <taavi>	 !log manually update GadgetUsage on enwiki T397454
[16:07:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:39] <stashbot>	 T397454: Show active user stats on Special:GadgetUsage in English Wikipedia - https://phabricator.wikimedia.org/T397454
[16:07:48] <jynus>	 urandom: done
[16:07:57] <urandom>	 jynus: thanks!
[16:08:14] <logmsgbot>	 jhancock@cumin1003 reimage (PID 3873963) is awaiting input
[16:08:29] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[16:08:29] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs6002.drmrs.wmnet,lvs5005.eqsin.wmnet,lvs3009.esams.wmnet,lvs7002.magru.wmnet,lvs4009.ulsfo.wmnet} and A:liberica (T394484)
[16:09:29] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2012 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[16:09:30] <jinxer-wm>	 FIRING: [6x] LibericaStaleConfig: Liberica instance lvs3009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig  - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[16:09:55] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission backup1002 and its disk array - https://phabricator.wikimedia.org/T398210#10959586 (10jcrespo)
[16:12:14] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission backup1002 and its disk array - https://phabricator.wikimedia.org/T398210#10959604 (10jcrespo) @Jclark-ctr @VRiley-WMF It is my understanding that these arrays don't have a network interface to disable/DNS to handle, but ofc they will have to be ha...
[16:12:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:12:20] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10959605 (10DerHexer) I must have overlooked the question for confirmation, I'm sorry. I had tested it immediately and it's working well for me, thank you!
[16:13:31] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[16:13:57] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.decommission for hosts backup2002.codfw.wmnet
[16:14:10] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs6003.drmrs.wmnet,lvs5006.eqsin.wmnet,lvs3010.esams.wmnet,lvs7003.magru.wmnet,lvs4010.ulsfo.wmnet} and A:liberica (T394484)
[16:14:16] <stashbot>	 T394484: Consider using a dedicated TLS certificate for upload.w.o - https://phabricator.wikimedia.org/T394484
[16:14:22] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10959621 (10Clement_Goubert) 05Stalled→03Resolved Thanks for confirming!
[16:14:39] <urandom>	 !log decommissioning Cassandra/sessionstore1005-a — T391544
[16:14:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:45] <stashbot>	 T391544: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544
[16:14:53] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm
[16:14:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10959628 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host sretest2005.codfw.wmnet with OS bookworm executed with errors: - sretest2005 (...
[16:15:41] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs6003.drmrs.wmnet,lvs5006.eqsin.wmnet,lvs3010.esams.wmnet,lvs7003.magru.wmnet,lvs4010.ulsfo.wmnet} and A:liberica (T394484)
[16:17:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:18:29] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[16:19:29] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2012 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[16:19:30] <jinxer-wm>	 RESOLVED: [9x] LibericaStaleConfig: Liberica instance lvs3009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig  - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[16:19:31] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.dns.netbox
[16:19:34] <jinxer-wm>	 FIRING: ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore1005-a:7000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:19:47] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore1005.eqiad.wmnet with OS bullseye
[16:19:57] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10959648 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1005.e...
[16:23:12] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003"
[16:23:32] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service sessionstore1004-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:23:32] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003"
[16:23:33] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:23:33] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts backup2002.codfw.wmnet
[16:24:38] <wikibugs>	 (03PS1) 10Clare Ming: Enable experiment configs fetching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165060 (https://phabricator.wikimedia.org/T397144)
[16:24:44] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) (owner: 10Jcrespo)
[16:26:46] <wikibugs>	 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 13Patch-For-Review: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10959688 (10DLynch) a:03DLynch
[16:26:50] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission backup2002 and its disk array - https://phabricator.wikimedia.org/T398212#10959689 (10jcrespo)
[16:26:55] <wikibugs>	 (03PS2) 10Clare Ming: Enable experiment configs fetching for group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165060 (https://phabricator.wikimedia.org/T397144)
[16:28:09] <logmsgbot>	 !log joal@deploy1003 Started deploy [airflow-dags/analytics_test@3c90af1]: Synchronize artifacat for airflow_dags/analytics_test
[16:28:22] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission backup2002 and its disk array - https://phabricator.wikimedia.org/T398212#10959703 (10jcrespo) @Jhancock.wm It is my understanding that these arrays don't have a network interface to disable/DNS to handle, but ofc they will have to be handled physi...
[16:28:24] <logmsgbot>	 !log joal@deploy1003 Finished deploy [airflow-dags/analytics_test@3c90af1]: Synchronize artifacat for airflow_dags/analytics_test (duration: 00m 15s)
[16:28:54] <logmsgbot>	 !log joal@deploy1003 Started deploy [airflow-dags/analytics@3c90af1]: Synchronize artifacat for airflow_dags/analytics_test
[16:29:05] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165060 (https://phabricator.wikimedia.org/T397144) (owner: 10Clare Ming)
[16:29:33] <logmsgbot>	 !log joal@deploy1003 Finished deploy [airflow-dags/analytics@3c90af1]: Synchronize artifacat for airflow_dags/analytics_test (duration: 00m 38s)
[16:31:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: civi2001 - https://phabricator.wikimedia.org/T397380#10959726 (10Jgreen) a:05Dwisehaupt→03None
[16:31:46] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] Enable experiment configs fetching for group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165060 (https://phabricator.wikimedia.org/T397144) (owner: 10Clare Ming)
[16:32:09] <logmsgbot>	 !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore1005.eqiad.wmnet with OS bullseye
[16:32:19] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10959734 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1005.eqiad...
[16:32:43] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore1005.eqiad.wmnet with OS bullseye
[16:32:52] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10959736 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1005.e...
[16:39:12] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm
[16:39:17] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10959771 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2005.codfw.wmnet with OS bookworm
[16:40:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10959779 (10Jhancock.wm)
[16:45:02] <logmsgbot>	 !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore1005.eqiad.wmnet with OS bullseye
[16:45:16] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10959793 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1005.eqiad...
[16:45:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet
[16:45:30] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore1005.eqiad.wmnet with OS bullseye
[16:45:47] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10959794 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1005.e...
[16:49:34] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:49:46] <wikibugs>	 (03PS15) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316
[16:50:54] <wikibugs>	 (03CR) 10JHathaway: "@rcoccioli@wikimedia.org I think this is ready for a second pass, when you have the time, thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway)
[16:52:14] <logmsgbot>	 jhancock@cumin1003 provision (PID 3881940) is awaiting input
[16:53:32] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:54:26] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:54:31] <wikibugs>	 (03PS3) 10Daniuu: nlwiki: add VRT agent user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216)
[16:55:26] <wikibugs>	 10ops-eqiad, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225 (10Eevans) 03NEW
[16:55:33] <wikibugs>	 10ops-eqiad, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10959846 (10Eevans) p:05Triage→03Unbreak!
[16:56:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [skins/MinervaNeue] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164474 (https://phabricator.wikimedia.org/T397539) (owner: 10Bernard Wang)
[16:56:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164475 (https://phabricator.wikimedia.org/T397469) (owner: 10Bernard Wang)
[16:56:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164475 (https://phabricator.wikimedia.org/T397469) (owner: 10Bernard Wang)
[16:57:35] <logmsgbot>	 !log tchin@deploy1003 Started deploy [airflow-dags/analytics@74e8d66]: Deploying artifacts for T388439
[16:57:42] <stashbot>	 T388439: Add metrics for monthly reconciles - https://phabricator.wikimedia.org/T388439
[16:58:11] <logmsgbot>	 !log tchin@deploy1003 Finished deploy [airflow-dags/analytics@74e8d66]: Deploying artifacts for T388439 (duration: 00m 52s)
[17:00:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430)
[17:00:05] <jouncebot>	 swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1700).
[17:00:05] <jouncebot>	 ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1700).
[17:00:15] <swfrench-wmf>	 o/
[17:00:41] <swfrench-wmf>	 the work I'd originally planned for this window will be deferred to a later date TBD
[17:02:37] <logmsgbot>	 jhancock@cumin1003 reimage (PID 3888301) is awaiting input
[17:03:25] <wikibugs>	 (03CR) 10Urbanecm: [C:04-1] "should be good to go once the privileged groups denotation is fixed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu)
[17:09:56] <wikibugs>	 (03CR) 10FNegri: [C:03+1] Remove root keys for former staff [labs/private] - 10https://gerrit.wikimedia.org/r/1165052 (owner: 10Majavah)
[17:15:04] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Replace analytics fake headers with vars [puppet] - 10https://gerrit.wikimedia.org/r/1147912 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall)
[17:21:57] <wikibugs>	 (03CR) 10Majavah: [V:03+2 C:03+2] Remove root keys for former staff [labs/private] - 10https://gerrit.wikimedia.org/r/1165052 (owner: 10Majavah)
[17:23:50] <wikibugs>	 (03CR) 10Michael Große: [C:04-1] "This should have had a -1, because a change is needed (see previous comment). But for the max-edit-limit, this is indeed what we want." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164979 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime)
[17:23:54] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:25:14] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165065
[17:28:00] <logmsgbot>	 jhancock@cumin1003 provision (PID 3893852) is awaiting input
[17:33:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) (owner: 10Michael Große)
[17:34:28] <wikibugs>	 (03Merged) 10jenkins-bot: Growth(enwiki): enable limiting Add a Link to new editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) (owner: 10Michael Große)
[17:34:44] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1164969|Growth(enwiki): enable limiting Add a Link to new editors (T386034)]]
[17:34:50] <stashbot>	 T386034: Add a Link: Community Configuration setting to allow limiting "Add a Link" to new editors  - https://phabricator.wikimedia.org/T386034
[17:35:35] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: Fix staging config path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165028 (owner: 10Jgiannelos)
[17:36:34] <wikibugs>	 (03PS4) 10Daniuu: nlwiki: add VRT agent user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216)
[17:36:39] <logmsgbot>	 !log urbanecm@deploy1003 migr, urbanecm: Backport for [[gerrit:1164969|Growth(enwiki): enable limiting Add a Link to new editors (T386034)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:38:31] <MichaelG_WMF>	 urbanecm: so far, I'm not seeing any obvious problems with a user having NO add-a-link and only the legacy template-based links task
[17:38:40] <urbanecm>	 MichaelG_WMF: me neither
[17:39:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397983#10960057 (10phaultfinder)
[17:40:16] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Use profiler script to spawn profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165068
[17:40:39] <wikibugs>	 10SRE-SLO, 10Observability-Metrics, 10SRE Observability (FY2024/2025-Q4): liftwing SLO performance issues - https://phabricator.wikimedia.org/T387350#10960064 (10herron) 05Open→03Resolved Optimistically resolving as we've tuned the window for istio slos to 4w (from 12w)
[17:40:52] <MichaelG_WMF>	 urbanecm: Ok, then I'd say let's move forward?
[17:40:57] <urbanecm>	 sure!
[17:40:59] <logmsgbot>	 !log urbanecm@deploy1003 migr, urbanecm: Continuing with sync
[17:41:45] <wikibugs>	 (03PS2) 10Jgiannelos: mobileapps: Use profiler script to spawn profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165068 (https://phabricator.wikimedia.org/T397750)
[17:42:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mobileapps: Use profiler script to spawn profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165068 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos)
[17:42:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[17:45:20] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:45:53] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:46:30] <logmsgbot>	 !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164969|Growth(enwiki): enable limiting Add a Link to new editors (T386034)]] (duration: 11m 46s)
[17:46:34] <urbanecm>	 MichaelG_WMF: should be deployed
[17:46:36] <stashbot>	 T386034: Add a Link: Community Configuration setting to allow limiting "Add a Link" to new editors  - https://phabricator.wikimedia.org/T386034
[17:47:10] <MichaelG_WMF>	 urbanecm: thanks, I'll check without mwdebug!
[17:48:09] <urbanecm>	 sounds good!
[17:48:55] <MichaelG_WMF>	 Looks good, as far as I can tell
[17:48:57] <logmsgbot>	 jhancock@cumin1003 provision (PID 3894577) is awaiting input
[17:49:18] <MichaelG_WMF>	 let's see how things develop and also with the upcoming train
[17:51:45] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-ntp (exit_code=0) rolling restart_daemons on A:dnsbox
[17:52:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10960126 (10RobH) @Jclark-ctr & @Jhancock.wm: Please note this was pinged in IRC as well, if either of you are on-site today/next, please address this issue.
[17:52:36] <urbanecm>	 MichaelG_WMF: i certainly hope train won't break :)
[17:52:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[17:53:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10960129 (10Jhancock.wm) it's a backplane communication error. someone onsite needs to reseat the cables to the backplane. 90% chance that fixes it.
[17:53:59] <MichaelG_WMF>	 urbanecm: I don't expect it to. We tested this in beta and all we found were minor UI issues, nothing that would break a train.
[17:54:34] <MichaelG_WMF>	 Still, next time it would be nice to have it enabled in testwiki early.
[17:55:33] <logmsgbot>	 jhancock@cumin1003 provision (PID 3894577) is awaiting input
[17:55:47] <brett>	 !log Implement Varnish vmod_var-based X-Analytics formatting - T373550
[17:55:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:53] <stashbot>	 T373550: Move varnish pseudo-headers to vmod_var variables - https://phabricator.wikimedia.org/T373550
[17:58:58] <wikibugs>	 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10960145 (10BCornwall) 05Open→03In progress p:05Triage→03Medium a:05RobH→03BCornwall
[18:00:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:04:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10960165 (10Jhancock.wm) @jcrespo hey i finished testing on this server. Do you want to take it for a spin? it's the new 1CPU Config-K. (note, the re-image is going to come back as faile...
[18:04:43] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10960181 (10Jhancock.wm) 05Open→03Resolved
[18:05:41] <logmsgbot>	 !log eevans@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore1005.eqiad.wmnet with OS bullseye
[18:05:53] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10960191 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1005.eqiad...
[18:05:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:09:31] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:12:13] <wikibugs>	 (03PS4) 10BCornwall: varnish: Implement translation analytics vars [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550)
[18:15:04] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:17:54] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:18:19] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:18:56] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:19:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[18:19:46] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:19:57] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:23:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sretest2010 to codfw - jhancock@cumin2002"
[18:23:25] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sretest2010 to codfw - jhancock@cumin2002"
[18:23:25] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:25:18] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host sretest2010
[18:27:47] <wikibugs>	 (03PS1) 10C. Scott Ananian: Disable ParserMigration indicator and user notice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484)
[18:28:21] <wikibugs>	 (03PS5) 10BCornwall: varnish: Implement translation analytics vars [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550)
[18:28:21] <logmsgbot>	 jhancock@cumin2002 configure-switch-interfaces (PID 3476591) is awaiting input
[18:28:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:28:38] <wikibugs>	 (03CR) 10Daniuu: "Removed the deleted permissions for now. If needed, we can add them again in a later patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu)
[18:30:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2010
[18:31:05] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:32:15] <wikibugs>	 (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu)
[18:32:23] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:33:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] nlwiki: add VRT agent user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu)
[18:33:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:34:14] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:35:03] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm
[18:35:11] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10960265 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host sretest2005.codfw.wmnet with OS bookworm executed with errors: - sretest20...
[18:36:53] <wikibugs>	 (03PS5) 10Daniuu: nlwiki: add VRT agent user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216)
[18:38:19] <wikibugs>	 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 10MW-1.45-notes (1.45.0-wmf.8; 2025-07-01): Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10960272 (10DLynch) @elukey Okay, this has made it to the train for this week, so we s...
[18:38:30] <logmsgbot>	 jhancock@cumin2002 provision (PID 3479969) is awaiting input
[18:40:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:41:10] <wikibugs>	 (03PS6) 10Daniuu: nlwiki: add VRT agent user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216)
[18:41:30] <wikibugs>	 (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu)
[18:41:59] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.discovery.service-route check sessionstore: maintenance
[18:41:59] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check sessionstore: maintenance
[18:42:08] <wikibugs>	 (03CR) 10BCornwall: [V:03+2] "They are (after fixing a bad rebase):" [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall)
[18:42:09] <logmsgbot>	 jhancock@cumin1003 provision (PID 3897463) is awaiting input
[18:43:26] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.discovery.service-route check sessionstore: maintenance
[18:43:26] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check sessionstore: maintenance
[18:44:00] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[18:46:29] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu)
[18:46:34] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:47:33] <logmsgbot>	 !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc2002.codfw.wmnet with reason: Decom
[18:47:43] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:47:46] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] doc: decom doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth)
[18:48:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu)
[18:50:06] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:50:28] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:50:56] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54083 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:51:10] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.decommission for hosts doc2002.codfw.wmnet
[18:51:18] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:55:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10960323 (10Eevans) >>! In T398225#10960125, @RobH wrote: > @Jclark-ctr & @Jhancock.wm: Please note this was pinged in IRC as well, if either of you are on-site today/next, please address this issue.  Shou...
[18:55:42] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Use GET instead of POST for MW API requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165110
[18:55:53] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.dns.netbox
[18:55:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:59:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service doc1004.eqiad.wmnet:443 has failed probes (http_doc1004_eqiad_wmnet_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:59:24] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doc2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1002"
[19:00:33] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doc2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1002"
[19:00:33] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:00:33] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doc2002.codfw.wmnet
[19:02:02] <wikibugs>	 (03PS2) 10Jgiannelos: mobileapps: Use GET instead of POST for MW API requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165110 (https://phabricator.wikimedia.org/T398167)
[19:02:51] <wikibugs>	 (03CR) 10Urbanecm: [C:04-1] mobileapps: Use GET instead of POST for MW API requests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165110 (https://phabricator.wikimedia.org/T398167) (owner: 10Jgiannelos)
[19:03:39] <wikibugs>	 (03CR) 10Urbanecm: [C:04-1] "curiosity question: how does this solve the issue? I'd like to understand the root cause here, if possible." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165110 (https://phabricator.wikimedia.org/T398167) (owner: 10Jgiannelos)
[19:05:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:19:17] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service doc1004.eqiad.wmnet:443 has failed probes (http_doc1004_eqiad_wmnet_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:28:45] <jinxer-wm>	 RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[19:45:42] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[19:53:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:53:32] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[20:00:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T2000).
[20:00:05] <jouncebot>	 EggRoll97, tgr, MichaelG_WMF, cjming, and bwang: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:07] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm
[20:00:19] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960707 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2006.codfw.wmnet with OS bookworm
[20:00:26] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm
[20:00:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960710 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host sretest2006.codfw.wmnet with OS bookworm executed with errors: - sretest2006 (...
[20:01:16] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm
[20:01:25] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2006.codfw.wmnet with OS bookworm
[20:01:37] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm
[20:01:40] <cjming>	 o/
[20:01:41] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host sretest2006.codfw.wmnet with OS bookworm executed with errors: - sretest2006 (...
[20:01:49] <tgr>	 o/
[20:01:53] <cjming>	 i can deploy for those not self-deploying
[20:02:21] <bwang>	 hi
[20:02:28] <tgr>	 my patch is safe to bundle with other stuff
[20:02:30] <bwang>	 Yes I have 2 things but im not self deploying
[20:03:47] <cjming>	 EggRoll97: are you here?
[20:04:23] <cjming>	 tgr: maybe I'll do yours and mine together?
[20:04:40] <cjming>	 MichaelG_WMF: how about you?
[20:04:51] <cjming>	 bwang: can all 3 of your go out together?
[20:04:57] <cjming>	 *yours
[20:05:21] <bwang>	 Its only 2
[20:05:23] <bwang>	 But yes
[20:05:52] <cjming>	 cool - ok i'll start with mine and tgr's
[20:06:19] <wikibugs>	 (03PS2) 10Gergő Tisza: Revert "Add scrambled: password class" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159568 (https://phabricator.wikimedia.org/T395360)
[20:06:44] <wikibugs>	 (03PS3) 10Clare Ming: Enable experiment configs fetching for group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165060 (https://phabricator.wikimedia.org/T397144)
[20:07:18] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:08:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159568 (https://phabricator.wikimedia.org/T395360) (owner: 10Gergő Tisza)
[20:08:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165060 (https://phabricator.wikimedia.org/T397144) (owner: 10Clare Ming)
[20:09:05] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add scrambled: password class" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159568 (https://phabricator.wikimedia.org/T395360) (owner: 10Gergő Tisza)
[20:09:12] <wikibugs>	 (03Merged) 10jenkins-bot: Enable experiment configs fetching for group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165060 (https://phabricator.wikimedia.org/T397144) (owner: 10Clare Ming)
[20:09:30] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1159568|Revert "Add scrambled: password class" (T395360 T395372)]], [[gerrit:1165060|Enable experiment configs fetching for group 0 (T397144)]]
[20:09:38] <stashbot>	 T395372: Handle scrambled password type in CentralAuth - https://phabricator.wikimedia.org/T395372
[20:09:38] <stashbot>	 T397144: MetricsPlatform: Enable experiment config fetching - https://phabricator.wikimedia.org/T397144
[20:11:27] <logmsgbot>	 !log cjming@deploy1003 cjming, tgr: Backport for [[gerrit:1159568|Revert "Add scrambled: password class" (T395360 T395372)]], [[gerrit:1165060|Enable experiment configs fetching for group 0 (T397144)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:11:59] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165150
[20:12:41] <tgr>	 cjming: mine works
[20:12:54] <tgr>	 oh wait, forgot to use XWD
[20:13:31] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[20:13:35] <tgr>	 okay still works
[20:13:40] <cjming>	 nice - will sync
[20:13:53] <logmsgbot>	 !log cjming@deploy1003 cjming, tgr: Continuing with sync
[20:14:17] <cjming>	 bwang: i'll do both of yours next
[20:14:48] <bwang>	 Thank you
[20:19:27] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159568|Revert "Add scrambled: password class" (T395360 T395372)]], [[gerrit:1165060|Enable experiment configs fetching for group 0 (T397144)]] (duration: 09m 57s)
[20:19:35] <stashbot>	 T395372: Handle scrambled password type in CentralAuth - https://phabricator.wikimedia.org/T395372
[20:19:35] <stashbot>	 T397144: MetricsPlatform: Enable experiment config fetching - https://phabricator.wikimedia.org/T397144
[20:20:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [skins/MinervaNeue] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164474 (https://phabricator.wikimedia.org/T397539) (owner: 10Bernard Wang)
[20:20:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164475 (https://phabricator.wikimedia.org/T397469) (owner: 10Bernard Wang)
[20:21:17] <wikibugs>	 (03Merged) 10jenkins-bot: Prevent extra scrolling when dialog is open on ios [skins/MinervaNeue] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164474 (https://phabricator.wikimedia.org/T397539) (owner: 10Bernard Wang)
[20:26:04] <bwang>	 Lmk when its ready to test!
[20:26:39] <cjming>	 will do - just waiting for your core backport to merge
[20:27:04] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack web proxy: allow 'proxyadmin' users to modify proxies [puppet] - 10https://gerrit.wikimedia.org/r/1165154 (https://phabricator.wikimedia.org/T273150)
[20:27:05] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack web proxy: allow 'puppetencadmin' users to modify per-vm puppet config [puppet] - 10https://gerrit.wikimedia.org/r/1165155 (https://phabricator.wikimedia.org/T273150)
[20:28:44] <wikibugs>	 (03PS1) 10Scott French: aptrepo: add php83 component and pcre2 updates [puppet] - 10https://gerrit.wikimedia.org/r/1165151 (https://phabricator.wikimedia.org/T398245)
[20:28:44] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1165151 (https://phabricator.wikimedia.org/T398245) (owner: 10Scott French)
[20:28:45] <wikibugs>	 (03PS2) 10Scott French: package_builder: add pbuilder hook for component/php83 [puppet] - 10https://gerrit.wikimedia.org/r/1165152 (https://phabricator.wikimedia.org/T398245)
[20:31:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] P:wmcs: ntp: Automatically restart the service after config changes [puppet] - 10https://gerrit.wikimedia.org/r/1164970 (https://phabricator.wikimedia.org/T398099) (owner: 10Majavah)
[20:32:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960902 (10Jhancock.wm)
[20:33:16] <wikibugs>	 (03Merged) 10jenkins-bot: Add workaround for iOS to ensure the virtual keyboard is opened when the mobile TAHS overlay is opened [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164475 (https://phabricator.wikimedia.org/T397469) (owner: 10Bernard Wang)
[20:33:27] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:33:34] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1164474|Prevent extra scrolling when dialog is open on ios (T397539)]], [[gerrit:1164475|Add workaround for iOS to ensure the virtual keyboard is opened when the mobile TAHS overlay is opened (T397469)]]
[20:33:41] <stashbot>	 T397539: Fix background scrolling on new mobile search experience - https://phabricator.wikimedia.org/T397539
[20:33:41] <stashbot>	 T397469: Remove extra tap when opening search bar on minerva - https://phabricator.wikimedia.org/T397469
[20:35:28] <logmsgbot>	 !log cjming@deploy1003 cjming, bwang: Backport for [[gerrit:1164474|Prevent extra scrolling when dialog is open on ios (T397539)]], [[gerrit:1164475|Add workaround for iOS to ensure the virtual keyboard is opened when the mobile TAHS overlay is opened (T397469)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:35:34] <cjming>	 bwang: on test servers if you want to check ^^
[20:36:18] <cjming>	 lmk if/when to sync
[20:38:44] <bwang>	 Ok done!
[20:38:51] <bwang>	 Go ahead
[20:38:51] <cjming>	 cool ! ok to sync?
[20:38:54] <cjming>	 nice
[20:39:02] <logmsgbot>	 !log cjming@deploy1003 cjming, bwang: Continuing with sync
[20:41:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:41:32] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:42:32] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:42:51] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:44:25] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:44:27] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm
[20:44:32] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960932 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2006.codfw.wmnet with OS bookworm
[20:44:44] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:44:57] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164474|Prevent extra scrolling when dialog is open on ios (T397539)]], [[gerrit:1164475|Add workaround for iOS to ensure the virtual keyboard is opened when the mobile TAHS overlay is opened (T397469)]] (duration: 11m 23s)
[20:45:02] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm
[20:45:04] <stashbot>	 T397539: Fix background scrolling on new mobile search experience - https://phabricator.wikimedia.org/T397539
[20:45:04] <stashbot>	 T397469: Remove extra tap when opening search bar on minerva - https://phabricator.wikimedia.org/T397469
[20:45:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960935 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host sretest2006.codfw.wmnet with OS bookworm executed with errors: - sretest2006 (...
[20:46:01] <cjming>	 EggRoll97 + MichaelG_WMF: if you're around and want to self-deploy, please go ahed -- if you need a deployer, please ping me and I can deploy for you
[20:46:10] <cjming>	 *ahead
[20:46:31] <cjming>	 I'll leave the backport window open for a few minutes longer
[20:48:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm
[20:49:04] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2006.codfw.wmnet with OS bookworm
[20:49:18] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm
[20:49:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960968 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2006.codfw.wmnet with OS bookworm executed with errors: - sretest2006 (...
[21:00:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430)
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: Time to do the Weekly Security deployment window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T2100).
[21:05:31] <wikibugs>	 (03PS1) 10Eevans: fix cookbook names in example text [cookbooks] - 10https://gerrit.wikimedia.org/r/1165161
[21:19:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397658#10961058 (10phaultfinder)
[21:23:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10961061 (10Jhancock.wm) @elukey trying to figure out why this reimage script isn't working on this test server. it has a raid and a boss card. the boss card has a raid 1 between the two...
[21:24:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 23.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:29:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 23.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:36:24] <wikibugs>	 (03CR) 10Scott French: [C:03+1] fix cookbook names in example text [cookbooks] - 10https://gerrit.wikimedia.org/r/1165161 (owner: 10Eevans)
[21:37:47] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Ensure that master=yarn is the default spark configuration for users [puppet] - 10https://gerrit.wikimedia.org/r/1164272 (https://phabricator.wikimedia.org/T393181) (owner: 10Btullis)
[21:39:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:44:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.72% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:47:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:52:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:54:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkNoRegisteredTask: ...
[21:54:45] <jinxer-wm>	 cirrus-streaming-updater job  in eqiad (k8s) is running without any taskmanagers - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic-backfill - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkNoRegisteredTask
[21:59:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkNoRegisteredTask: ...
[21:59:45] <jinxer-wm>	 cirrus-streaming-updater job  in eqiad (k8s) is running without any taskmanagers - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic-backfill - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkNoRegisteredTask
[22:03:03] <wikibugs>	 (03CR) 10Ladsgroup: "Thank you and sorry for such a dumb mistake" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164984 (owner: 10Gmodena)
[22:05:23] <wikibugs>	 (03PS1) 10Ladsgroup: Revert^2 "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169
[22:06:21] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10961146 (10Htriedman) Hi @Clement_Goubert! When I navigate to the L3 document page, there's no option to sign again — any way I...
[22:15:28] <wikibugs>	 (03PS2) 10Ladsgroup: Revert^2 "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169
[22:16:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert^2 "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 (owner: 10Ladsgroup)
[22:18:40] <wikibugs>	 (03PS3) 10Ladsgroup: Revert^2 "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169
[22:21:09] <wikibugs>	 (03PS4) 10Cwhite: logstash: filter_on_template_v2 fixes [puppet] - 10https://gerrit.wikimedia.org/r/1163486 (https://phabricator.wikimedia.org/T234565)
[22:21:38] <wikibugs>	 (03PS3) 10Cwhite: logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565)
[22:24:53] <wikibugs>	 (03PS4) 10Cwhite: logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565)
[22:26:23] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: filter_on_template_v2 fixes [puppet] - 10https://gerrit.wikimedia.org/r/1163486 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[22:28:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:29:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:38:43] <wikibugs>	 (03PS1) 10Cwhite: logstash: re-enable filter_on_template_v2 on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1165173
[22:40:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:44:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:45:42] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: re-enable filter_on_template_v2 on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1165173 (owner: 10Cwhite)
[22:46:14] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 331 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[22:47:14] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29767 bytes in 0.211 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[22:47:42] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:57:28] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:59:20] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm
[22:59:32] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10961285 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host sretest2006.codfw.wmnet with OS bookworm
[22:59:39] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm
[22:59:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10961286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host sretest2006.codfw.wmnet with OS bookworm executed with errors: - sretest2006 (**...
[23:00:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430)
[23:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T2300)
[23:02:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:09:59] <wikibugs>	 (03PS1) 10Tim Starling: uppercaseTitlesForUnicodeTransition: Add file table [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1165179 (https://phabricator.wikimedia.org/T383496)
[23:15:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:17:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:29:25] <TimStarling>	 "web team does not typically check IRC so assume this is not being used if 5 minutes past the start"
[23:29:29] <TimStarling>	 classy
[23:29:51] <wikibugs>	 (03CR) 10Tim Starling: [C:03+2] uppercaseTitlesForUnicodeTransition: Add file table [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1165179 (https://phabricator.wikimedia.org/T383496) (owner: 10Tim Starling)
[23:30:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:33:44] <wikibugs>	 (03Merged) 10jenkins-bot: uppercaseTitlesForUnicodeTransition: Add file table [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1165179 (https://phabricator.wikimedia.org/T383496) (owner: 10Tim Starling)
[23:34:42] <logmsgbot>	 !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1165179|uppercaseTitlesForUnicodeTransition: Add file table (T383496)]]
[23:34:47] <stashbot>	 T383496: Add support for reading new file schema into MediaWiki - https://phabricator.wikimedia.org/T383496
[23:36:41] <logmsgbot>	 !log tstarling@deploy1003 tstarling: Backport for [[gerrit:1165179|uppercaseTitlesForUnicodeTransition: Add file table (T383496)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:38:10] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1165180
[23:38:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1165180 (owner: 10TrainBranchBot)
[23:38:47] <logmsgbot>	 !log tstarling@deploy1003 tstarling: Continuing with sync
[23:42:59] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10961353 (10Ladsgroup) ` root@ms-fe1009:~# swift stat --lh wikipedia-commons-local-thumb.13                Account: AUTH_mw              Container: wikipedia-commons-local-thumb.13...
[23:43:47] <wikibugs>	 (03PS1) 10Clare Ming: xLab: Deploy v0.7.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165181 (https://phabricator.wikimedia.org/T396151)
[23:44:16] <logmsgbot>	 !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165179|uppercaseTitlesForUnicodeTransition: Add file table (T383496)]] (duration: 09m 34s)
[23:44:22] <stashbot>	 T383496: Add support for reading new file schema into MediaWiki - https://phabricator.wikimedia.org/T383496
[23:45:20] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.7.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165181 (https://phabricator.wikimedia.org/T396151) (owner: 10Clare Ming)
[23:45:42] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[23:47:00] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploy v0.7.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165181 (https://phabricator.wikimedia.org/T396151) (owner: 10Clare Ming)
[23:47:27] <wikibugs>	 (03PS1) 10Clare Ming: xLab: Deploy v0.7.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165182 (https://phabricator.wikimedia.org/T396151)
[23:48:32] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.7.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165182 (https://phabricator.wikimedia.org/T396151) (owner: 10Clare Ming)
[23:49:53] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[23:50:07] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploy v0.7.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165182 (https://phabricator.wikimedia.org/T396151) (owner: 10Clare Ming)
[23:50:37] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1165180 (owner: 10TrainBranchBot)
[23:50:52] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[23:53:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:53:36] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[23:59:54] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply