[00:08:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1164702 [00:08:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1164702 (owner: 10TrainBranchBot) [00:13:31] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [00:30:22] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1164702 (owner: 10TrainBranchBot) [00:46:19] (03PS1) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1164703 [00:46:28] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [01:12:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:22:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:28:31] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:03:32] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:31] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:42:42] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:37:27] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:45:42] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:13:31] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:58:55] (03CR) 10Kosta Harlan: [C:03+1] "Removing the -2 per T397940#10956874" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [05:59:30] (03PS8) 10Tchanders: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) [06:00:04] (03PS9) 10Kosta Harlan: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [06:01:29] (03PS10) 10Kosta Harlan: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [06:04:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [06:04:36] jouncebot: nowandnext [06:04:36] For the next 0 hour(s) and 55 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250629T0700) [06:04:36] In 0 hour(s) and 55 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T0700) [06:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:16:50] (03PS1) 10Slyngshede: data.yaml offboarding trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1164707 [06:18:11] (03CR) 10Muehlenhoff: data.yaml offboarding trokhymovych (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164707 (owner: 10Slyngshede) [06:18:25] (03CR) 10CI reject: [V:04-1] data.yaml offboarding trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1164707 (owner: 10Slyngshede) [06:20:04] (03PS2) 10Slyngshede: data.yaml offboarding trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1164707 [06:21:20] (03CR) 10Muehlenhoff: [C:04-1] "Pending extension, see the mail thread with Diego" [puppet] - 10https://gerrit.wikimedia.org/r/1164707 (owner: 10Slyngshede) [06:26:34] (03PS1) 10Muehlenhoff: Add ganeti role to ganeti2049 [puppet] - 10https://gerrit.wikimedia.org/r/1164708 (https://phabricator.wikimedia.org/T396590) [06:29:24] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti role to ganeti2049 [puppet] - 10https://gerrit.wikimedia.org/r/1164708 (https://phabricator.wikimedia.org/T396590) (owner: 10Muehlenhoff) [06:31:37] (03PS1) 10Slyngshede: data.yaml offboarding mnz [puppet] - 10https://gerrit.wikimedia.org/r/1164709 [06:32:35] !log push pfw policies - T397875 [06:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:10] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1164709 (owner: 10Slyngshede) [06:35:36] (03CR) 10Slyngshede: [C:03+2] data.yaml offboarding mnz [puppet] - 10https://gerrit.wikimedia.org/r/1164709 (owner: 10Slyngshede) [06:37:25] FIRING: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti2049:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:38:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet [06:38:32] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:40:02] !log slyngshede@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muniza out of all services on: 2400 hosts [06:40:22] (03PS3) 10Ayounsi: Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 [06:40:58] 06SRE, 06Data-Engineering: WE 5.4 FY 25/26: Improve automata detection at the edge and pass it to the refinery pipeline - https://phabricator.wikimedia.org/T396562#10957224 (10Joe) >>! In T396562#10907844, @JAllemandou wrote: > I think this would be feasible as most of frontend data is already available in th... [06:45:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet [06:47:10] (03PS1) 10Volans: JS/CSS: fix CSP headers and CDN inclusion [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164859 (https://phabricator.wikimedia.org/T397696) [06:47:11] (03PS1) 10Volans: JS/CSS: update DataTables library [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164860 (https://phabricator.wikimedia.org/T397696) [06:51:21] (03PS1) 10Muehlenhoff: Update records for dhardy [puppet] - 10https://gerrit.wikimedia.org/r/1164861 [06:56:11] (03PS1) 10Filippo Giunchedi: prometheus: split pushgateway logs [puppet] - 10https://gerrit.wikimedia.org/r/1164862 (https://phabricator.wikimedia.org/T398091) [06:56:37] !log jmm@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti2049.codfw.wmnet to cluster codfw and group B [06:58:34] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2049.codfw.wmnet to cluster codfw and group B [07:00:04] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T0700). [07:00:04] koi, DreamRimmer, and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:28] o/ [07:04:25] o/ [07:04:31] hi, I'm here [07:04:57] for deployment? [07:05:15] it seems like no other deployers are around, so, yes, I could do that [07:05:25] I'll have a look at the patches [07:08:01] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: split pushgateway logs [puppet] - 10https://gerrit.wikimedia.org/r/1164862 (https://phabricator.wikimedia.org/T398091) (owner: 10Filippo Giunchedi) [07:08:26] I do not have anything to test these patches, but I hope you can proceed based on the +1 from other devs. I am backporting these at the request of the original uploader, as they are not currently available [07:08:45] Look good to me too [07:10:26] my patch is simple enough and i can test it quickly [07:12:32] I'm not totally sure about https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1164506, and would prefer that the author or reviewer are around to verify those changes [07:13:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163081 (https://phabricator.wikimedia.org/T397676) (owner: 10Stang) [07:13:35] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: split pushgateway logs [puppet] - 10https://gerrit.wikimedia.org/r/1164862 (https://phabricator.wikimedia.org/T398091) (owner: 10Filippo Giunchedi) [07:14:26] (03Merged) 10jenkins-bot: zhwiki: Remove autopatrol from patroller group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163081 (https://phabricator.wikimedia.org/T397676) (owner: 10Stang) [07:14:41] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957279 (10MoritzMuehlenhoff) [07:14:54] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1163081|zhwiki: Remove autopatrol from patroller group (T397676)]] [07:15:00] T397676: Remove autopatrol from patroller group on zhwiki - https://phabricator.wikimedia.org/T397676 [07:15:20] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [07:15:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957283 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs [07:15:50] kostajh: patch author here. what's up? [07:16:10] hi NovemLinguae [07:16:40] NovemLinguae: would you be able to verify https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1164506 after deployment? Is there any risk of breakage when updating this config? [07:17:59] the patch that introduces that config var, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/1090512, hasn't ridden the train yet, was merged on friday. might make it hard to test [07:18:49] if we want thorough testing, i guess we could backport that one too? or delay this backport a couple days? [07:19:42] (03CR) 10Kosta Harlan: initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164506 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae) [07:20:25] NovemLinguae: what changes would we expect to see once the train reaches group2, and your config patch is enabled? [07:22:02] !log bounce prometheus-pushgateway on prometheus1005 - T398091 [07:22:04] SecurePoll local elections would start writing JSON to subpages of the page MediaWiki:SecurePoll, as a type of logging when certain poll settings pages are edited (Special:SecurePoll/create, Special:SecurePoll/edit, Special:SecurePoll/translate, Special:SecurePoll/votereligibility). This is identical to the behavior of an existing setting, with the only difference being the target page it writes to [07:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:08] T398091: Prometheus1005 out of disk on / - https://phabricator.wikimedia.org/T398091 [07:22:57] MediaWiki:SecurePoll and its subpages are configured as a read only namespace. The idea of doing this kind of logging is to provide a useful history tab to see who modified what settings when. [07:23:32] (03CR) 10Slyngshede: [C:03+1] "LGTM, email is not active yet, but will be soon." [puppet] - 10https://gerrit.wikimedia.org/r/1164861 (owner: 10Muehlenhoff) [07:25:25] Re your code review comment, wmgSecurePollUseNamespace defaults to false. Can add enwiki = false if you want though. Up to you. [07:25:34] (03CR) 10Muehlenhoff: [C:03+2] Update records for dhardy [puppet] - 10https://gerrit.wikimedia.org/r/1164861 (owner: 10Muehlenhoff) [07:26:28] NovemLinguae: ack. So for now, nothing would change on enwiki today - it would just impact new polls that are created in the future? [07:27:10] Correct. Starting on Thursday when the new setting rides the train, then CREATING or EDITING a LOCAL poll will start writing stuff to MediaWiki:SecurePoll/* subpages. [07:27:24] (unless we backport that patch right now) [07:28:47] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164860 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [07:30:06] ok [07:30:09] seems fine, then [07:30:32] :) [07:33:16] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet [07:33:32] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:17] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2004.codfw.wmnet to drbd [07:35:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957308 (10ops-monitoring-bot) VM aux-k8s-etcd2004.codfw.wmnet switching disk type to drbd [07:37:39] still waiting on images to get built for the first patch [07:37:42] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:39:34] NovemLinguae: so, to confirm, there's nothing for you to verify when I sync your patches, right? [07:39:48] other than that enwiki still loads :) [07:40:06] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: disable nftables prometheus exporter script in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1164389 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto) [07:41:41] !log restart prometheus-pushgateway on prometheus1005 with fresh state - T398091 [07:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:48] T398091: Prometheus1005 out of disk on / - https://phabricator.wikimedia.org/T398091 [07:42:03] correct. I don't think I can test https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1164507 either since it doesn't affect enwiki. someone with the proper election administrator user group on those wikis could go edit a test poll, then check their contribs to make sure an edit was made [07:42:16] (testwiki, votewiki) [07:42:49] if a diff link is given i'd be happy to analyze it and confirm that it's working [07:43:16] I might still have rights if we want to check? Let me confirm [07:43:51] Yes I'm still an election admin on votewiki if we want to QA the change. [07:44:09] Tran: cool, thanks. Will let you know when the changes are staged [07:44:13] !log kharlan@deploy1003 stang, kharlan: Backport for [[gerrit:1163081|zhwiki: Remove autopatrol from patroller group (T397676)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:44:20] T397676: Remove autopatrol from patroller group on zhwiki - https://phabricator.wikimedia.org/T397676 [07:44:24] koi: your change is available for testing now [07:44:31] lookin [07:45:01] kostajh: LGTM [07:45:08] !log kharlan@deploy1003 stang, kharlan: Continuing with sync [07:45:10] koi: thanks [07:45:42] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:48:30] !log installing krb5 security updates [07:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:15] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host apt1002.wikimedia.org [07:52:24] (03PS10) 10Vgutierrez: cache,haproxy: refactor haproxy captures to fix x-analytics logging [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [07:52:28] 06SRE: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#10957327 (10Aklapper) [07:54:03] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [07:56:23] PROBLEM - Host aux-k8s-etcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [07:56:42] uh... expected? [07:57:01] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt1002.wikimedia.org [07:57:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:57:30] (03PS1) 10Jelto: gitlab: fix typo in hiera config [puppet] - 10https://gerrit.wikimedia.org/r/1164944 (https://phabricator.wikimedia.org/T396622) [07:58:20] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163081|zhwiki: Remove autopatrol from patroller group (T397676)]] (duration: 43m 26s) [07:58:26] T397676: Remove autopatrol from patroller group on zhwiki - https://phabricator.wikimedia.org/T397676 [07:59:23] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2004.codfw.wmnet to drbd [08:00:02] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6104/console" [puppet] - 10https://gerrit.wikimedia.org/r/1164944 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto) [08:00:15] RECOVERY - Host aux-k8s-etcd2004 is UP: PING OK - Packet loss = 0%, RTA = 30.77 ms [08:00:16] vgutierrez: yeah, I had a switch the disk type away from local disk storage, since the underlying Ganeti node is being decommed [08:00:26] and that needs a reboot to effect the change [08:00:27] ack, thx [08:01:02] ok, on to the securepoll patches [08:01:08] that ook a long time! [08:01:13] *took [08:01:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164506 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae) [08:01:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164507 (owner: 10Novem Linguae) [08:01:50] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: fix typo in hiera config [puppet] - 10https://gerrit.wikimedia.org/r/1164944 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto) [08:02:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:02:28] (03Merged) 10jenkins-bot: initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164506 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae) [08:02:31] (03Merged) 10jenkins-bot: refactor unnecessary wmgSecurePollUseNamespace variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164507 (owner: 10Novem Linguae) [08:02:47] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1164506|initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki (T398080)]], [[gerrit:1164507|refactor unnecessary wmgSecurePollUseNamespace variable]] [08:02:53] T398080: Set $wgSecurePollUseMediaWikiNamespace = true on English Wikipedia - https://phabricator.wikimedia.org/T398080 [08:04:29] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [08:04:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957357 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs [08:07:03] !log kharlan@deploy1003 novemlinguae, kharlan: Backport for [[gerrit:1164506|initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki (T398080)]], [[gerrit:1164507|refactor unnecessary wmgSecurePollUseNamespace variable]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:07:35] for testing, can visit Special:SecurePoll and click around and make sure no obvious PHP errors, etc. [08:07:49] Tran: ok, ready for review now [08:07:54] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet [08:08:18] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2004.codfw.wmnet to plain [08:08:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957365 (10ops-monitoring-bot) VM aux-k8s-etcd2004.codfw.wmnet switching disk type to plain [08:08:55] (03CR) 10Giuseppe Lavagetto: [C:03+1] cache,haproxy: refactor haproxy captures to fix x-analytics logging [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [08:08:57] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2004.codfw.wmnet to plain [08:10:00] I clicked around Special:SecurePoll and Special:SecurePoll/translate, no errors. (Not that I'd expect any for setting an undeployed config option, but just to be sure.) [08:10:16] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [08:10:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957366 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs [08:10:56] Triggered a few intentional errors on the create page, changed voter eligibility, ran a tally - no errors, saw the voter eligibility change logged [08:11:01] ok, cool [08:11:03] thanks! [08:11:32] !log kharlan@deploy1003 novemlinguae, kharlan: Continuing with sync [08:12:29] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet [08:13:24] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164859 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:13:31] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:13:32] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:41] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10957372 (10MoritzMuehlenhoff) [08:13:50] <_joe_> elukey: is the docker-reporter thing you? [08:14:26] Does anyone know if it's possible to remove someone else's -2 on a patch? (this patch in particular https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1163738) [08:14:49] _joe_ no I think Moritz was working on it, but it must be another occurrence of the v2 vs v1 api, need to add it to the exclude filter [08:16:28] (03PS1) 10Majavah: openstack: puppet-enc: Return helpful error for invalid role data [puppet] - 10https://gerrit.wikimedia.org/r/1164945 (https://phabricator.wikimedia.org/T398117) [08:16:54] _joe_: we're looking into it, seems introduced by https://phabricator.wikimedia.org/T395826 [08:17:12] <_joe_> ^_^ [08:17:13] kostajh: hover over the code-review item over "Submit Requirements" in the left sidebar, and there is a trash can icon to delete a vote [08:17:32] <_joe_> well *WMCS* doesn't really matter here, so we need to exclude that cluster I guess? [08:19:02] taavi: thanks, but I am not seeing "Submit requirements" [08:19:05] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164506|initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki (T398080)]], [[gerrit:1164507|refactor unnecessary wmgSecurePollUseNamespace variable]] (duration: 16m 17s) [08:19:10] T398080: Set $wgSecurePollUseMediaWikiNamespace = true on English Wikipedia - https://phabricator.wikimedia.org/T398080 [08:19:59] kostajh: i see it. screenshot: https://imgur.com/a/WQqD07j [08:20:05] (03PS1) 10Elukey: docker-report: fix k8s filter [puppet] - 10https://gerrit.wikimedia.org/r/1164946 [08:20:28] kostajh: hmm, this is what the left sidebar looks for me: https://prod-misc-upload.public.object.majava.org/taavi/0JySJ1nI1Stut.png do you not see that bottom bit? [08:20:28] I think it's because in this case, the -2 is from the owner [08:20:29] (03CR) 10Vgutierrez: [C:03+2] cache,haproxy: refactor haproxy captures to fix x-analytics logging [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [08:20:42] aha [08:20:43] thanks [08:21:01] _joe_ we have a proposal to make debmonitor checking only images running on k8s clusters, it should make things easier. It doesn't make sense to track everything like we do now.. [08:21:23] (03PS1) 10Volans: kubernetes: skip missing/failing images on update [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164947 (https://phabricator.wikimedia.org/T397696) [08:22:06] (03CR) 10Ayounsi: [C:03+1] "nice ! LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1163858 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [08:22:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [08:22:22] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1164946 (owner: 10Elukey) [08:22:49] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [08:23:08] (03Merged) 10jenkins-bot: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [08:23:13] (03CR) 10Elukey: [C:03+2] docker-report: fix k8s filter [puppet] - 10https://gerrit.wikimedia.org/r/1164946 (owner: 10Elukey) [08:23:21] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1163738|temp accounts: Enable temp account creation on further wikis (T397940)]] [08:23:27] T397940: Batch 3 deployment of Temp Accounts Major pilots - https://phabricator.wikimedia.org/T397940 [08:24:13] (03CR) 10Ayounsi: [C:03+1] "You can also update the doc once it's merged https://wikitech.wikimedia.org/wiki/Anycast" [puppet] - 10https://gerrit.wikimedia.org/r/1163858 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [08:25:11] kostajh: thanks for doing backports this morning. is nice to see folks helping in the UTC morning backport window, which is sometimes a bit quiet [08:25:12] (03CR) 10Ayounsi: [C:03+1] "with one nit" [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [08:25:16] !log kharlan@deploy1003 kharlan, tchanders: Backport for [[gerrit:1163738|temp accounts: Enable temp account creation on further wikis (T397940)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:25:17] appreciate it! [08:25:35] (03CR) 10Elukey: [C:03+1] JS/CSS: fix CSP headers and CDN inclusion [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164859 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:25:46] NovemLinguae: sure thing, thanks for the discussion about your patches, and for your work on SecurePoll! [08:25:59] yw :) [08:26:33] (03PS1) 10KartikMistry: Remove cxstats campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164948 (https://phabricator.wikimedia.org/T393705) [08:26:36] (03CR) 10Elukey: [C:03+1] JS/CSS: update DataTables library [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164860 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:26:58] !log kharlan@deploy1003 kharlan, tchanders: Continuing with sync [08:29:28] (03CR) 10Elukey: [C:03+1] "Very nice I like it!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164947 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:30:49] <_joe_> elukey: I agree [08:31:31] _joe_ we have some thing almost ready after the hackathon, more info to follow soon :) [08:31:36] *something [08:31:36] (03CR) 10Arnaudb: [C:03+1] doc: decom doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [08:32:09] <_joe_> elukey: I think there's some value for the releng images too, which are ran constantly in CI, but that's not as important [08:32:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:32:23] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163738|temp accounts: Enable temp account creation on further wikis (T397940)]] (duration: 09m 01s) [08:32:28] T397940: Batch 3 deployment of Temp Accounts Major pilots - https://phabricator.wikimedia.org/T397940 [08:33:01] _joe_ true true, I'd like to understand the use case though, because I have the sense that nobody really pay attention to those :D [08:33:28] (03CR) 10Volans: [C:03+2] JS/CSS: fix CSP headers and CDN inclusion [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164859 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:33:34] <_joe_> elukey: I agree, but that' s because we don't have a closed loop on maintenance [08:33:50] (03CR) 10Volans: [C:03+2] JS/CSS: update DataTables library [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164860 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:34:13] (03CR) 10Volans: [C:03+2] kubernetes: skip missing/failing images on update [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164947 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:34:24] (03Merged) 10jenkins-bot: JS/CSS: fix CSP headers and CDN inclusion [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164859 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:34:40] (03Merged) 10jenkins-bot: JS/CSS: update DataTables library [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164860 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:35:04] (03Merged) 10jenkins-bot: kubernetes: skip missing/failing images on update [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164947 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:35:27] _joe_: we could also have separate instances, on for prod and one for CI [08:35:49] (03CR) 10Nikerabbit: [C:03+1] Remove cxstats campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164948 (https://phabricator.wikimedia.org/T393705) (owner: 10KartikMistry) [08:37:06] (03PS1) 10Vgutierrez: Revert "cache,haproxy: refactor haproxy captures to fix x-analytics logging" [puppet] - 10https://gerrit.wikimedia.org/r/1164952 [08:37:31] (03PS2) 10Vgutierrez: Revert "cache,haproxy: refactor haproxy captures to fix x-analytics logging" [puppet] - 10https://gerrit.wikimedia.org/r/1164952 (https://phabricator.wikimedia.org/T397917) [08:37:39] (03PS1) 10Volans: Upstream release v0.6.3 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164953 [08:37:40] (03CR) 10David Caro: "LGTM, just a nit that does not need to be fixed in this patch, but would be good to keep in mind" [puppet] - 10https://gerrit.wikimedia.org/r/1164945 (https://phabricator.wikimedia.org/T398117) (owner: 10Majavah) [08:37:45] (03CR) 10David Caro: [C:03+1] openstack: puppet-enc: Return helpful error for invalid role data [puppet] - 10https://gerrit.wikimedia.org/r/1164945 (https://phabricator.wikimedia.org/T398117) (owner: 10Majavah) [08:37:57] (03CR) 10Volans: [C:03+2] Upstream release v0.6.3 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164953 (owner: 10Volans) [08:38:22] (03CR) 10Majavah: [C:03+2] openstack: puppet-enc: Return helpful error for invalid role data [puppet] - 10https://gerrit.wikimedia.org/r/1164945 (https://phabricator.wikimedia.org/T398117) (owner: 10Majavah) [08:38:31] (03CR) 10Majavah: [C:03+2] openstack: puppet-enc: Return helpful error for invalid role data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164945 (https://phabricator.wikimedia.org/T398117) (owner: 10Majavah) [08:39:05] (03Merged) 10jenkins-bot: Upstream release v0.6.3 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164953 (owner: 10Volans) [08:39:55] !log UTC morning deploys done [08:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:00] (03CR) 10Vgutierrez: [C:03+2] Revert "cache,haproxy: refactor haproxy captures to fix x-analytics logging" [puppet] - 10https://gerrit.wikimedia.org/r/1164952 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez) [08:42:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:45:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:47:51] !log uploaded debmonitor-server,python3-debmonitor_0.6.3 to apt.wikimedia.org bookworm-wikimedia [08:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:54] \o/ [08:50:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.31% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:50:57] !log test routed ganeti compatible bird on ganeti2034/testvm2006 - T362392 [08:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:03] T362392: Routed Ganeti: Add support for VM BGP - https://phabricator.wikimedia.org/T362392 [08:53:32] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:22] (03PS1) 10Volans: CSP: fix headers when loading local resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164955 (https://phabricator.wikimedia.org/T397696) [08:59:19] (03CR) 10Elukey: [C:03+1] CSP: fix headers when loading local resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164955 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:00:50] (03CR) 10Volans: [C:03+2] CSP: fix headers when loading local resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164955 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:01:41] (03Merged) 10jenkins-bot: CSP: fix headers when loading local resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164955 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:02:53] (03PS1) 10Tiziano Fogli: LibericaEtcdErrors: disable pint check for missing metrics [alerts] - 10https://gerrit.wikimedia.org/r/1164957 (https://phabricator.wikimedia.org/T396320) [09:04:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet [09:04:29] (03PS1) 10Aklapper: Push Due Date value higher [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1164958 [09:04:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10957604 (10ops-monitoring-bot) Draining ganeti5004.eqsin.wmnet of running VMs [09:04:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet [09:05:20] (03CR) 10Aklapper: [V:03+2 C:03+2] Push Due Date value higher [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1164958 (owner: 10Aklapper) [09:06:05] (03PS1) 10Tiziano Fogli: PyBalBGPUnstable: disable pint check for missing metrics [alerts] - 10https://gerrit.wikimedia.org/r/1164959 (https://phabricator.wikimedia.org/T396321) [09:12:26] (03CR) 10FNegri: [C:03+1] P:exim4::smarthost: Migrate to ec-prime256v1 certificates [puppet] - 10https://gerrit.wikimedia.org/r/1164483 (owner: 10Majavah) [09:12:39] (03CR) 10Majavah: [C:03+2] P:exim4::smarthost: Migrate to ec-prime256v1 certificates [puppet] - 10https://gerrit.wikimedia.org/r/1164483 (owner: 10Majavah) [09:12:57] (03CR) 10FNegri: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1164945 (https://phabricator.wikimedia.org/T398117) (owner: 10Majavah) [09:14:50] !log klausman@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ml-staging-ctrl2001.codfw.wmnet with reason: Shutting down for Ganeti migration [09:18:51] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [09:19:06] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957626 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs [09:19:53] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet [09:20:17] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [09:20:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957628 (10ops-monitoring-bot) Draining ganeti2019.codfw.wmnet of running VMs [09:20:50] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet [09:23:07] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet [09:23:39] !log klausman@cumin2002 START - Cookbook sre.hosts.remove-downtime for ml-staging-ctrl2001.codfw.wmnet [09:23:40] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-staging-ctrl2001.codfw.wmnet [09:26:48] (03CR) 10Hnowlan: [C:03+2] changeprop: fix broken metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164422 (https://phabricator.wikimedia.org/T397970) (owner: 10Hnowlan) [09:26:59] (03PS1) 10Jgiannelos: mobileapps: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 [09:28:32] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:53] (03Merged) 10jenkins-bot: changeprop: fix broken metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164422 (https://phabricator.wikimedia.org/T397970) (owner: 10Hnowlan) [09:31:08] (03PS2) 10Jgiannelos: mobileapps: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 [09:38:33] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [09:39:06] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [09:40:21] (03PS1) 10Muehlenhoff: docker-report: filter all zuul images in oci filter [puppet] - 10https://gerrit.wikimedia.org/r/1164963 [09:42:16] (03CR) 10Elukey: [C:03+1] docker-report: filter all zuul images in oci filter [puppet] - 10https://gerrit.wikimedia.org/r/1164963 (owner: 10Muehlenhoff) [09:42:19] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [09:42:58] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [09:43:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet [09:44:01] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10957739 (10ops-monitoring-bot) Draining ganeti5004.eqsin.wmnet of running VMs [09:45:48] (03CR) 10Muehlenhoff: [C:03+2] docker-report: filter all zuul images in oci filter [puppet] - 10https://gerrit.wikimedia.org/r/1164963 (owner: 10Muehlenhoff) [09:51:10] (03PS1) 10Hnowlan: changeprop: add missing total_delay histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164964 (https://phabricator.wikimedia.org/T397970) [09:51:13] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [09:51:53] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [09:53:32] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:53:59] !log installing nginx security updates [09:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:19] (03PS1) 10Jelto: gitlab: pass ensure flag to auto_restarts::service [puppet] - 10https://gerrit.wikimedia.org/r/1164965 (https://phabricator.wikimedia.org/T396622) [09:56:08] (03CR) 10Majavah: [C:03+2] toolforge: wmcs-package-build: Fix Aptly host name [puppet] - 10https://gerrit.wikimedia.org/r/1153586 (owner: 10Majavah) [09:56:15] (03CR) 10Majavah: [C:03+2] toolforge: wmcs-package-build: Remove unneeded escape [puppet] - 10https://gerrit.wikimedia.org/r/1153587 (https://phabricator.wikimedia.org/T396004) (owner: 10Majavah) [09:56:27] (03CR) 10Majavah: [C:03+2] P:toolforge: aptly: Install rsync for backups [puppet] - 10https://gerrit.wikimedia.org/r/1153588 (owner: 10Majavah) [09:57:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testreduce1002.eqiad.wmnet [09:57:43] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6105/console" [puppet] - 10https://gerrit.wikimedia.org/r/1164965 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto) [10:00:05] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1000) [10:00:44] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: pass ensure flag to auto_restarts::service [puppet] - 10https://gerrit.wikimedia.org/r/1164965 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto) [10:01:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:01:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1002.eqiad.wmnet [10:01:56] (03PS1) 10Btullis: Airflow-main: Increase parallelism and related values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164967 (https://phabricator.wikimedia.org/T398164) [10:05:20] !log depool codfw ms-swift for container DB repairs T383053 [10:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:26] T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053 [10:05:34] !log mvernon@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw [10:06:05] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54084 bytes in 9.965 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:06:58] !log repair wikipedia-commons-local-thumb.6e on ms-be2059 ms-be2058 ms-be2076 T383053 [10:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:18] log installing openssl security updates [10:09:05] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:11:09] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 8.657 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:11:19] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1003.eqiad.wmnet with reason: Maintenance and reboot [10:12:11] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:14:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:14:22] (03PS1) 10Michael Große: Growth(enwiki): enable limiting Add a Link to new editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) [10:17:10] !log repair wikipedia-commons-local-thumb.99 on ms-be2064 T383053 [10:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:16] T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053 [10:18:03] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54084 bytes in 8.163 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:18:09] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 9.559 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:18:32] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:19:00] (03PS1) 10Majavah: P:wmcs: ntp: Automatically restart the service after config changes [puppet] - 10https://gerrit.wikimedia.org/r/1164970 (https://phabricator.wikimedia.org/T398099) [10:20:42] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1164970 (https://phabricator.wikimedia.org/T398099) (owner: 10Majavah) [10:21:14] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be2076.codfw.wmnet with reason: container db repair [10:21:23] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10957886 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=86ad3659-4dfe-4a19-8925-7580975c3341) set by mvernon@cumin2002 fo... [10:22:05] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:22:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:22:28] !log repair wikipedia-commons-local-thumb.bb on ms-be2076 T383053 [10:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:34] T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053 [10:22:38] (03PS1) 10Ayounsi: Remove some Arelion/NTT traffic engineering [homer/public] - 10https://gerrit.wikimedia.org/r/1164972 (https://phabricator.wikimedia.org/T377844) [10:23:09] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:23:59] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:24:01] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54084 bytes in 5.572 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:24:07] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 8.038 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:24:13] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki: Bump mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164269 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [10:24:29] (03CR) 10Alexandros Kosiaris: [C:03+2] mw-debug: Specify upstream_retry_policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164270 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [10:24:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be2076.codfw.wmnet [10:24:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2076.codfw.wmnet [10:25:38] !log mvernon@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=swift,name=codfw [10:25:56] !log repool codfw ms-swift after container DB repairs T383053 [10:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:07] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6e [10:26:52] (03Merged) 10jenkins-bot: mediawiki: Bump mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164269 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [10:27:26] (03CR) 10Joal: [C:03+1] "LGTM! Should we also bump the available CPU resource for the pod?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164967 (https://phabricator.wikimedia.org/T398164) (owner: 10Btullis) [10:27:37] (03Merged) 10jenkins-bot: mw-debug: Specify upstream_retry_policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164270 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [10:28:30] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [10:28:32] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:28:53] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [10:29:02] (03PS25) 10Volans: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [10:29:49] (03CR) 10Cyndywikime: [C:03+1] Growth(enwiki): enable limiting Add a Link to new editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) (owner: 10Michael Große) [10:30:06] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10957935 (10MoritzMuehlenhoff) [10:30:18] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts ganeti2021.codfw.wmnet [10:33:19] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [10:33:36] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup1003.eqiad.wmnet: Renew puppet certificate - jynus@cumin1003 [10:33:38] (03PS1) 10Giuseppe Lavagetto: external_clouds_vendors: add ImageSiftBot [puppet] - 10https://gerrit.wikimedia.org/r/1164975 [10:33:51] jmm@cumin1003 decommission (PID 3848085) is awaiting input [10:34:04] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [10:35:48] (03CR) 10Giuseppe Lavagetto: [C:03+2] external_clouds_vendors: add ImageSiftBot [puppet] - 10https://gerrit.wikimedia.org/r/1164975 (owner: 10Giuseppe Lavagetto) [10:36:05] PROBLEM - Host backup1003 is DOWN: PING CRITICAL - Packet loss = 100% [10:36:30] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.6e [10:37:15] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [10:37:27] RECOVERY - Host backup1003 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [10:37:58] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [10:38:09] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [10:40:30] !log depool cp7001 [10:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:18] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10957961 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium Hi, As far as I can tell, you have access to the `analytics-platform-eng-adm... [10:42:18] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7001.magru.wmnet with reason: haproxy tetss [10:43:34] jmm@cumin1003 decommission (PID 3848085) is awaiting input [10:43:45] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [10:43:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10957980 (10Clement_Goubert) [10:45:00] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10957983 (10Clement_Goubert) An old version of the L3 document was signed, could you sign the updated version as well, please? [10:45:09] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10957984 (10Clement_Goubert) a:03Clement_Goubert [10:45:54] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2021.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [10:46:29] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10957990 (10Clement_Goubert) 05Open→03Stalled Stalled waiting for SSH k... [10:46:35] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10957992 (10Clement_Goubert) p:05Triage→03Medium [10:46:40] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2021.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [10:46:40] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:46:41] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2021.codfw.wmnet [10:47:27] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:47:37] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts ganeti2022.codfw.wmnet [10:47:42] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet [10:48:12] 06SRE, 10LDAP-Access-Requests: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10958013 (10Clement_Goubert) 05Open→03Stalled a:03Clement_Goubert Stalled waiting for confirmation of access from @DerHexer [10:49:09] 06SRE, 10LDAP-Access-Requests: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10958019 (10Clement_Goubert) p:05Triage→03Medium [10:49:47] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10958020 (10Clement_Goubert) [10:50:00] (03CR) 10Hnowlan: [C:04-1] "This will need an image bump. Will `0x` be in the $PATH by default?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 (owner: 10Jgiannelos) [10:50:43] jmm@cumin1003 decommission (PID 3848941) is awaiting input [10:50:45] (03PS1) 10Cyndywikime: Growth: Configure higher impact module edit limits for english and test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164979 (https://phabricator.wikimedia.org/T341599) [10:52:27] !log jmm@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2019.codfw.wmnet with reason: remove for decom [10:53:28] 06SRE: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#10958046 (10Joe) p:05Triage→03High a:03Joe [10:53:34] (03CR) 10Ayounsi: "Awesome, thanks a lot! a few small comments inline." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1161448 (owner: 10Effie Mouzeli) [10:54:55] jmm@cumin1003 decommission (PID 3848941) is awaiting input [10:55:28] (03CR) 10Urbanecm: "question: should we enable on testwiki first, or is this safe enough to deploy on enwiki together with testwiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) (owner: 10Michael Große) [10:56:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:58:08] (03PS1) 10Muehlenhoff: Remove ganeti2019 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1164980 (https://phabricator.wikimedia.org/T396590) [10:58:33] (03CR) 10Michael Große: "TBH, I half made-up the decision to also deploy it to testwiki in the first place, because I don't think that it makes sense to deploy it " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) (owner: 10Michael Große) [10:58:34] (03CR) 10CI reject: [V:04-1] Remove ganeti2019 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1164980 (https://phabricator.wikimedia.org/T396590) (owner: 10Muehlenhoff) [10:59:34] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp7001.magru.wmnet [10:59:34] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7001.magru.wmnet [10:59:47] !log taavi@cumin1003 START - Cookbook sre.dns.netbox [11:00:35] (03CR) 10Vgutierrez: [C:03+1] service: remove ProxyFetch checks for kartotherian, thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1161485 (https://phabricator.wikimedia.org/T397148) (owner: 10Hnowlan) [11:01:43] (03PS1) 10Jcrespo: bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01 [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) [11:02:51] (03PS26) 10Volans: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [11:03:52] (03PS69) 10Cathal Mooney: sre.dns.netbox-future cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [11:05:09] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [11:05:17] !log repool cp7001 [11:05:19] taavi@cumin1003 netbox (PID 3851379) is awaiting input [11:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:37] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:07:37] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2022.codfw.wmnet [11:09:12] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission ganeti2021 / ganeti2022 - https://phabricator.wikimedia.org/T398182#10958090 (10MoritzMuehlenhoff) [11:09:18] (03CR) 10Filippo Giunchedi: [C:03+1] PyBalBGPUnstable: disable pint check for missing metrics [alerts] - 10https://gerrit.wikimedia.org/r/1164959 (https://phabricator.wikimedia.org/T396321) (owner: 10Tiziano Fogli) [11:09:23] (03CR) 10Filippo Giunchedi: [C:03+1] LibericaEtcdErrors: disable pint check for missing metrics [alerts] - 10https://gerrit.wikimedia.org/r/1164957 (https://phabricator.wikimedia.org/T396320) (owner: 10Tiziano Fogli) [11:09:27] (03PS2) 10Jcrespo: bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01 [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) [11:09:56] !log taavi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add cloudvps ns-recursor v6 addresses - taavi@cumin1003" [11:10:13] !log taavi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add cloudvps ns-recursor v6 addresses - taavi@cumin1003" [11:10:13] !log taavi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:10:20] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo) [11:10:34] (03CR) 10Vgutierrez: [C:03+2] haproxy,varnish: Introduce a host independent healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/1164449 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [11:10:47] (03PS1) 10Majavah: Add include for WMCS codfw private service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1164983 (https://phabricator.wikimedia.org/T379282) [11:11:52] (03CR) 10Majavah: [C:03+2] Add include for WMCS codfw private service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1164983 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [11:11:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:11:58] !log taavi@dns1004 START - running authdns-update [11:12:18] (03PS3) 10Jcrespo: bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01 [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) [11:12:22] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo) [11:12:37] (03PS2) 10Muehlenhoff: Remove ganeti2019 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1164980 (https://phabricator.wikimedia.org/T396590) [11:12:57] !log Start GrowthExperiments:fixLinkRecommendationData --wiki=enwiki --db-table --force (T386867) [11:13:00] !log taavi@dns1004 END - running authdns-update [11:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:03] T386867: Add a Link: add "do not link" rule for country names (Q6256) on English Wikipedia - https://phabricator.wikimedia.org/T386867 [11:15:44] (03CR) 10Jcrespo: "@Alex see here promised cleanup." [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo) [11:16:01] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti2019 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1164980 (https://phabricator.wikimedia.org/T396590) (owner: 10Muehlenhoff) [11:16:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2009:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:18:06] (03PS1) 10Gmodena: Revert "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164984 [11:18:07] jouncebot: nowandnext [11:18:07] No deployments scheduled for the next 1 hour(s) and 41 minute(s) [11:18:07] In 1 hour(s) and 41 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1300) [11:18:17] (03CR) 10Hnowlan: [C:03+2] service: remove ProxyFetch checks for kartotherian, thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1161485 (https://phabricator.wikimedia.org/T397148) (owner: 10Hnowlan) [11:18:21] hey folks. We to do an emergency deployment for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1164984 [11:19:10] cc ^ taavi [11:20:27] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet [11:20:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10958131 (10ops-monitoring-bot) Draining ganeti2020.codfw.wmnet of running VMs [11:21:38] hey folks. We to do an emergency deployment for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1164984 cc / taavi [11:21:48] *need [11:21:58] !log depool eqiad ms-swift for container DB repairs T383053 [11:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:04] T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053 [11:22:05] head's up vgutierrez slyngs (on-call) ^ [11:22:09] !log mvernon@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=swift,name=eqiad [11:22:22] And I think you can go ahead gmodena [11:22:38] (03PS2) 10Btullis: Airflow-main: Increase parallelism and related values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164967 (https://phabricator.wikimedia.org/T398164) [11:22:38] not sure why you're only pinging me, 301 effie [11:23:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164984 (owner: 10Gmodena) [11:23:41] (03CR) 10Jcrespo: "@volans could I get a review from you to alter: cumin/aliases.yaml.erb as suggested?" [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo) [11:24:08] (03Merged) 10jenkins-bot: Revert "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164984 (owner: 10Gmodena) [11:24:22] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1164984|Revert "Clean up EventBus and jobs config"]] [11:26:34] !log phuedx@deploy1003 gmodena, phuedx: Backport for [[gerrit:1164984|Revert "Clean up EventBus and jobs config"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:26:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2009:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:29:33] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet [11:29:47] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet [11:30:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10958161 (10ops-monitoring-bot) Draining ganeti2020.codfw.wmnet of running VMs [11:35:53] (03CR) 10Btullis: "Done." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164967 (https://phabricator.wikimedia.org/T398164) (owner: 10Btullis) [11:35:54] verified on the testserver. Syncing. [11:36:01] !log phuedx@deploy1003 gmodena, phuedx: Continuing with sync [11:38:50] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[1063,1074,1083].eqiad.wmnet with reason: container db repair [11:38:56] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10958196 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b9fb1e08-c62e-4b34-b173-ffc58ee22ef8) set by mvernon@cumin2002 fo... [11:39:18] !log repair wikipedia-commons-local-thumb.6b on ms-be10[63,74,83] T383053 [11:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:24] T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053 [11:40:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164948 (https://phabricator.wikimedia.org/T393705) (owner: 10KartikMistry) [11:41:41] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164984|Revert "Clean up EventBus and jobs config"]] (duration: 17m 19s) [11:43:03] (03CR) 10Btullis: [C:03+2] Airflow-main: Increase parallelism and related values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164967 (https://phabricator.wikimedia.org/T398164) (owner: 10Btullis) [11:43:17] There's been no changes in the logs on mwlog1002 [11:43:24] (Good) [11:43:56] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be[1063,1074,1083].eqiad.wmnet [11:43:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[1063,1074,1083].eqiad.wmnet [11:44:55] (03Merged) 10jenkins-bot: Airflow-main: Increase parallelism and related values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164967 (https://phabricator.wikimedia.org/T398164) (owner: 10Btullis) [11:45:27] (03CR) 10Majavah: [C:03+2] dynamicproxy: Support IPv6-enabled recursors [puppet] - 10https://gerrit.wikimedia.org/r/1160135 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [11:45:34] (03CR) 10Majavah: [C:03+2] P:toolforge: nginx: Support IPv6-enabled recursors [puppet] - 10https://gerrit.wikimedia.org/r/1160136 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [11:45:42] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:45:49] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[1067,1070,1089].eqiad.wmnet with reason: container db repair [11:45:55] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10958214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=05b6b261-690d-46bf-a42d-e69d15adcfc8) set by mvernon@cumin2002 fo... [11:45:56] !log repair wikipedia-commons-local-thumb.79 on ms-be10[70,67,89] T383053 [11:45:59] Are ye all done with your deploys? I'd like to restart pybal if possible [11:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:03] T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053 [11:47:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet [11:47:55] hnowlan: I'm done with my deploy [11:48:19] cool, ty [11:50:00] !log restarting pybal on lvs-secondary-eqiad [11:50:01] (03CR) 10Muehlenhoff: [C:03+2] Add Joanna to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/1161896 (owner: 10Muehlenhoff) [11:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:12] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be[1067,1070,1089].eqiad.wmnet [11:50:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[1067,1070,1089].eqiad.wmnet [11:50:19] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [11:51:47] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10958244 (10MoritzMuehlenhoff) [11:52:02] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[1066,1087,1090].eqiad.wmnet with reason: container db repair [11:52:08] !log repair wikipedia-commons-local-thumb.b7 ms-be10[66,87,90] T383053 [11:52:09] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10958245 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0077a8da-dd9e-44d3-b59a-d42061bdb69b) set by mvernon@cumin2002 fo... [11:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:13] T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053 [11:52:20] !log installing mongo-c-driver security updates [11:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:11] (03CR) 10Jgiannelos: "Good call, its under node_modules." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 (owner: 10Jgiannelos) [11:53:56] (03CR) 10Volans: "Sure the change LGTM. I assumed that the olddirector is excluded as it's being decommissioned." [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo) [11:55:45] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164986 [11:56:13] !log jmm@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti5004.eqsin.wmnet with reason: reimage [11:56:18] !log restarting pybal on A:lvs-low-traffic-eqiad [11:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:24] (03PS3) 10Jgiannelos: mobileapps: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 [11:57:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be[1066,1087,1090].eqiad.wmnet [11:57:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[1066,1087,1090].eqiad.wmnet [11:58:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5004.eqsin.wmnet with OS bookworm [11:58:45] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10958264 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5004.eqsin.wmnet with OS bookworm [11:59:15] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[1078-1079,1085].eqiad.wmnet with reason: container db repair [11:59:20] !log repair wikipedia-commons-local-thumb.d3 on ms-be10[78,79,85] T383053 [11:59:23] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10958267 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=469f036d-84d0-4d5b-8246-e8056b4949ca) set by mvernon@cumin2002 fo... [11:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:26] T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053 [11:59:47] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164986 (owner: 10PipelineBot) [12:00:06] (03CR) 10Volans: "It's being already removed in this change, my bad :)" [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo) [12:00:18] (03CR) 10Jgiannelos: "I updated the patch with the correct path and the new image." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 (owner: 10Jgiannelos) [12:00:48] (03CR) 10Hnowlan: [C:03+1] mobileapps: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 (owner: 10Jgiannelos) [12:00:56] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [12:01:09] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [12:01:12] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 (owner: 10Jgiannelos) [12:03:11] (03Merged) 10jenkins-bot: mobileapps: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164961 (owner: 10Jgiannelos) [12:04:15] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be[1078-1079,1085].eqiad.wmnet [12:04:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[1078-1079,1085].eqiad.wmnet [12:05:27] !log repair wikipedia-commons-local-thumb.ea on ms-be10[78,80] T383053 [12:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:33] T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053 [12:05:36] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[1078,1080].eqiad.wmnet with reason: container db repair [12:05:41] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10958286 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a1ed9509-0228-4592-b2ad-7dd36a1c170f) set by mvernon@cumin2002 fo... [12:06:23] (03PS2) 10JMeybohm: k8s.pool-depool-cluster: Black format [cookbooks] - 10https://gerrit.wikimedia.org/r/1160816 (https://phabricator.wikimedia.org/T397148) [12:06:24] (03PS12) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) [12:07:52] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:08:10] (03CR) 10Majavah: [C:03+2] natlog: Persist logs to /srv [puppet] - 10https://gerrit.wikimedia.org/r/1160104 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [12:08:37] (03CR) 10Jcrespo: "Thank, Volans, will wait for Alex or someone else for the rest of the changes to review." [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo) [12:08:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10958288 (10VRiley-WMF) [12:09:55] (03CR) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [12:10:34] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be[1078,1080].eqiad.wmnet [12:10:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[1078,1080].eqiad.wmnet [12:10:46] (03PS4) 10Jcrespo: bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01 [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) [12:11:12] !log repool eqiad ms-swift after container DB repairs T383053 [12:11:13] !log mvernon@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=swift,name=eqiad [12:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:21] T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053 [12:11:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10958293 (10VRiley-WMF) @BTullis We have been trying to image an-worker1186 for a while now. Working with @Jhancock.wm on this for a while and it... [12:11:40] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [12:11:40] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo) [12:12:13] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ea [12:13:31] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:14:26] (03PS1) 10Brouberol: global_config: provision thanos-swift-{eqiad,codfw} external services [puppet] - 10https://gerrit.wikimedia.org/r/1164991 (https://phabricator.wikimedia.org/T398186) [12:15:16] (03PS1) 10Brouberol: airflow-ml: enable task pods to reach out to thanos-swift in both DCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164993 (https://phabricator.wikimedia.org/T398186) [12:15:17] (03PS1) 10Brouberol: airflow-ml: define a connection to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164994 (https://phabricator.wikimedia.org/T398186) [12:15:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10958319 (10Jclark-ctr) [12:17:48] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:17:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159568 (https://phabricator.wikimedia.org/T395360) (owner: 10Gergő Tisza) [12:18:44] (03PS1) 10Volans: images: improve extenal images support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164996 (https://phabricator.wikimedia.org/T397696) [12:21:26] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ea [12:22:07] (03PS2) 10Volans: debmonitor: use the new endpoint for the check [puppet] - 10https://gerrit.wikimedia.org/r/1164485 (https://phabricator.wikimedia.org/T397696) [12:22:07] (03PS1) 10Volans: debmonitor: add link to docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/1164999 (https://phabricator.wikimedia.org/T397696) [12:23:03] (03CR) 10Btullis: [C:03+1] global_config: provision thanos-swift-{eqiad,codfw} external services [puppet] - 10https://gerrit.wikimedia.org/r/1164991 (https://phabricator.wikimedia.org/T398186) (owner: 10Brouberol) [12:23:24] (03CR) 10Btullis: [C:03+1] airflow-ml: enable task pods to reach out to thanos-swift in both DCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164993 (https://phabricator.wikimedia.org/T398186) (owner: 10Brouberol) [12:24:06] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5004.eqsin.wmnet with reason: host reimage [12:24:11] (03CR) 10Btullis: [C:03+1] airflow-ml: define a connection to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164994 (https://phabricator.wikimedia.org/T398186) (owner: 10Brouberol) [12:24:40] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: provision thanos-swift-{eqiad,codfw} external services [puppet] - 10https://gerrit.wikimedia.org/r/1164991 (https://phabricator.wikimedia.org/T398186) (owner: 10Brouberol) [12:24:51] (03CR) 10Urbanecm: "Doing both in the same window wouldn't provide additional opportunities for testing. If we are confident about the feature, deploying as-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) (owner: 10Michael Große) [12:24:57] (03CR) 10Elukey: [C:03+1] images: improve extenal images support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164996 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [12:25:09] (03CR) 10Elukey: [C:03+1] debmonitor: add link to docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/1164999 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [12:25:18] (03CR) 10Brouberol: [C:03+2] airflow-ml: enable task pods to reach out to thanos-swift in both DCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164993 (https://phabricator.wikimedia.org/T398186) (owner: 10Brouberol) [12:26:50] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [12:26:55] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:27:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5004.eqsin.wmnet with reason: host reimage [12:28:30] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [12:28:37] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [12:29:53] (03CR) 10Volans: [C:03+2] images: improve extenal images support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164996 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [12:30:43] (03Merged) 10jenkins-bot: images: improve extenal images support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164996 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [12:31:43] (03CR) 10Clément Goubert: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [12:32:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:32:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:34:00] (03PS1) 10Muehlenhoff: Use separate resource names for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1165003 [12:34:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [12:35:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [12:35:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165003 (owner: 10Muehlenhoff) [12:36:27] !log installing qtbase-opensource-src security updates [12:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [12:38:49] FIRING: HelmReleaseBadStatus: Helm release airflow-main/production on k8s-dse@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-main - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:38:53] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [12:39:13] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [12:40:27] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1165003 (owner: 10Muehlenhoff) [12:40:32] (03CR) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [12:40:41] (03PS3) 10JMeybohm: k8s.pool-depool-cluster: Black format [cookbooks] - 10https://gerrit.wikimedia.org/r/1160816 (https://phabricator.wikimedia.org/T397148) [12:40:41] (03PS13) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) [12:43:45] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [12:43:49] RESOLVED: HelmReleaseBadStatus: Helm release airflow-main/production on k8s-dse@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-main - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:45:45] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:46:33] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [12:47:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:49:10] !log jgreen@cumin1002 START - Cookbook sre.dns.netbox [12:52:57] (03CR) 10Brouberol: [C:03+2] airflow-ml: define a connection to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164994 (https://phabricator.wikimedia.org/T398186) (owner: 10Brouberol) [12:54:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5004.eqsin.wmnet with OS bookworm [12:54:17] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10958470 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5004.eqsin.wmnet with OS bookworm completed: - ganeti5004 (**PASS*... [12:54:43] jgreen@cumin1002 netbox (PID 1860130) is awaiting input [12:55:20] (03CR) 10Muehlenhoff: [C:03+2] Use separate resource names for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1165003 (owner: 10Muehlenhoff) [12:55:41] !log jgreen@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove host frnetmon1001.frack.eqiad.wmnet from DNS for decommissioning - jgreen@cumin1002" [12:55:45] !log jgreen@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove host frnetmon1001.frack.eqiad.wmnet from DNS for decommissioning - jgreen@cumin1002" [12:55:45] !log jgreen@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:56:18] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frnetmon1001 - https://phabricator.wikimedia.org/T398079#10958475 (10Jgreen) a:05Jgreen→03None [12:56:49] PROBLEM - very high load average likely xfs on thanos-be1007 is CRITICAL: CRITICAL - load average: 112.71, 101.97, 93.48 https://wikitech.wikimedia.org/wiki/Swift [12:58:00] (03PS1) 10Volans: Upstream release v0.6.4 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165006 [12:58:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [12:58:49] PROBLEM - very high load average likely xfs on thanos-be1009 is CRITICAL: CRITICAL - load average: 107.95, 101.43, 96.22 https://wikitech.wikimedia.org/wiki/Swift [12:58:52] !log jmm@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ganeti5004 [13:00:05] Urbanecm and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1300). [13:00:05] phuedx, LD, and sd0001: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] !log jmm@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti5004 [13:00:31]  partyt time \O/ [13:00:37] o/ [13:01:06] o/ [13:01:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:02:00] I can deploy mine. LD, sd0001: Can you deploy yours? [13:02:16] yep [13:02:23] i mean no [13:02:25] I don't have access, can you deploy mine too? [13:02:51] (03PS1) 10Jgiannelos: mobileapps: Use default logging level on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165009 [13:02:59] (03CR) 10CI reject: [V:04-1] mobileapps: Use default logging level on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165009 (owner: 10Jgiannelos) [13:03:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:03:54] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:03:56] (03PS2) 10Jgiannelos: mobileapps: Use default logging level on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165009 [13:04:07] By the way, neither of our changes has proper tests, so it's an all-in [13:04:28] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6b [13:04:49] PROBLEM - very high load average likely xfs on thanos-be1007 is CRITICAL: CRITICAL - load average: 105.24, 100.36, 95.66 https://wikitech.wikimedia.org/wiki/Swift [13:05:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164388 (https://phabricator.wikimedia.org/T397611) (owner: 10Phuedx) [13:05:51] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:05:57] LD: Do you have a way to test your change when it's on the test servers? [13:06:25] (03Merged) 10jenkins-bot: ext-EventStreamConfig: Remove eventlogging_TwoColConflict* streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164388 (https://phabricator.wikimedia.org/T397611) (owner: 10Phuedx) [13:06:38] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1164388|ext-EventStreamConfig: Remove eventlogging_TwoColConflict* streams (T397611)]] [13:06:43] T397611: Decommission the TwoColConflictConflict and -Exit instruments - https://phabricator.wikimedia.org/T397611 [13:07:04] phuedx i don't think so [13:07:19] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:07:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:07:49] PROBLEM - very high load average likely xfs on thanos-be1009 is CRITICAL: CRITICAL - load average: 111.00, 102.57, 98.82 https://wikitech.wikimedia.org/wiki/Swift [13:08:04] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10958533 (10MatthewVernon) These corrupt DBs have all been repaired now. [13:08:31] !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1164388|ext-EventStreamConfig: Remove eventlogging_TwoColConflict* streams (T397611)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:08:36] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165006 (owner: 10Volans) [13:09:25] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:09:36] Tested by hitting the streamconfigs API. The TwoColConflict* stream configs are not present in the response on the test server [13:09:43] !log phuedx@deploy1003 phuedx: Continuing with sync [13:10:01] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:10:11] (03PS27) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [13:12:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:12:40] (03PS1) 10Muehlenhoff: Add library hint for qtbase-opensource-src [puppet] - 10https://gerrit.wikimedia.org/r/1165012 [13:12:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:13:18] (03PS1) 10Eevans: sessionstore1004: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165013 (https://phabricator.wikimedia.org/T391544) [13:13:19] (03PS1) 10Eevans: sessionstore1004: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165014 (https://phabricator.wikimedia.org/T391544) [13:13:21] (03PS1) 10Eevans: sessionstore1005: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165015 (https://phabricator.wikimedia.org/T391544) [13:13:22] (03PS1) 10Eevans: sessionstore1005: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165016 (https://phabricator.wikimedia.org/T391544) [13:13:24] (03PS1) 10Eevans: sessionstore1006: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165017 (https://phabricator.wikimedia.org/T391544) [13:13:26] (03PS1) 10Eevans: sessionstore1006: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165018 (https://phabricator.wikimedia.org/T391544) [13:13:28] (03PS1) 10Eevans: sessionstore: preseed eqiad servers for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1165019 (https://phabricator.wikimedia.org/T391544) [13:14:04] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.6b [13:14:07] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.79 [13:14:31] sd0001: I see the note about query performance against your patch. Has the issue been taken care of? [13:15:11] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164388|ext-EventStreamConfig: Remove eventlogging_TwoColConflict* streams (T397611)]] (duration: 08m 32s) [13:15:16] T397611: Decommission the TwoColConflictConflict and -Exit instruments - https://phabricator.wikimedia.org/T397611 [13:15:30] phuedx: yes, the query was tested on mwdebug by musikanimal and it took only 1 minute [13:16:05] (it's run in a cron job, not in webrequests, so 1 min is fine) [13:16:08] (03CR) 10JHathaway: [C:03+1] Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi) [13:16:45] jclark@cumin1002 provision (PID 1884572) is awaiting input [13:16:49] PROBLEM - very high load average likely xfs on thanos-be1007 is CRITICAL: CRITICAL - load average: 104.40, 101.15, 98.76 https://wikitech.wikimedia.org/wiki/Swift [13:17:10] (03CR) 10Jgiannelos: [C:04-1] "Even with this 0x is spamming with debug logs. Looking at it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165009 (owner: 10Jgiannelos) [13:17:35] (03CR) 10Cyndywikime: "This patch is ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164979 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [13:17:49] PROBLEM - very high load average likely xfs on thanos-be1009 is CRITICAL: CRITICAL - load average: 100.88, 100.01, 99.29 https://wikitech.wikimedia.org/wiki/Swift [13:17:58] (03PS1) 10Tiziano Fogli: pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091) [13:18:24] (03CR) 10CI reject: [V:04-1] pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091) (owner: 10Tiziano Fogli) [13:18:32] thanos-be cluster struggling is expected? [13:18:48] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for qtbase-opensource-src [puppet] - 10https://gerrit.wikimedia.org/r/1165012 (owner: 10Muehlenhoff) [13:19:35] (03PS2) 10Tiziano Fogli: pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091) [13:19:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161478 (https://phabricator.wikimedia.org/T397063) (owner: 10LD) [13:20:26] (03Merged) 10jenkins-bot: frwiki: allow bureaucrats to assign and remove temporary-account-viewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161478 (https://phabricator.wikimedia.org/T397063) (owner: 10LD) [13:20:40] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1161478|frwiki: allow bureaucrats to assign and remove temporary-account-viewer group (T397063)]] [13:20:46] T397063: frwiki: allow bureaucrats to assign and remove temporary-account-viewer group - https://phabricator.wikimedia.org/T397063 [13:20:49] PROBLEM - very high load average likely xfs on thanos-be1009 is CRITICAL: CRITICAL - load average: 105.14, 100.81, 99.62 https://wikitech.wikimedia.org/wiki/Swift [13:21:23] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1165013 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [13:21:28] (03PS1) 10Brouberol: airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165021 (https://phabricator.wikimedia.org/T398164) [13:21:49] PROBLEM - very high load average likely xfs on thanos-be1007 is CRITICAL: CRITICAL - load average: 105.78, 100.90, 98.99 https://wikitech.wikimedia.org/wiki/Swift [13:21:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:22:34] !log phuedx@deploy1003 phuedx, wpld: Backport for [[gerrit:1161478|frwiki: allow bureaucrats to assign and remove temporary-account-viewer group (T397063)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:23:20] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.79 [13:23:21] LD: I looked at your change. It's been OK'd by Tchanders and Dreamy_Jazz. Unless you know a bureaucrat on frwiki, it'll be quite hard to test on the test servers :) [13:23:22] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.99 [13:23:41] (03PS3) 10Tiziano Fogli: pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091) [13:24:10] phuedx I'm not ^^ [13:25:02] (03CR) 10Btullis: [C:03+1] airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165021 (https://phabricator.wikimedia.org/T398164) (owner: 10Brouberol) [13:25:04] but it should work fine, coreperm stuff anyway [13:25:33] !log phuedx@deploy1003 phuedx, wpld: Continuing with sync [13:26:34] jclark@cumin1002 reimage (PID 1894712) is awaiting input [13:27:19] !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ganeti5004 [13:27:47] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti5004 [13:28:35] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10958586 (10MoritzMuehlenhoff) [13:29:01] sd0001: I've had a quick look at the patch that's referenced in the commit. The query is run on every request to /wiki/Special:GadgetUsage and not via a cron job [13:29:25] no, it's cached as part of the QueryPage system if MiserMode is off [13:29:36] * I mean if MiserMode is on [13:29:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [13:29:56] !log jmm@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ganeti5004 [13:29:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:30:19] !log jmm@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti5004 [13:31:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.09% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:31:16] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1161478|frwiki: allow bureaucrats to assign and remove temporary-account-viewer group (T397063)]] (duration: 10m 36s) [13:31:22] T397063: frwiki: allow bureaucrats to assign and remove temporary-account-viewer group - https://phabricator.wikimedia.org/T397063 [13:32:04] (03CR) 10Ssingh: [C:03+1] "Looks good note that even for Cloud, we made a few changes to the ntpsec.conf file, such as removing the iburst option. See Id62f3bf2a4d11" [puppet] - 10https://gerrit.wikimedia.org/r/1164970 (https://phabricator.wikimedia.org/T398099) (owner: 10Majavah) [13:32:17] (03CR) 10Brouberol: [C:03+2] airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165021 (https://phabricator.wikimedia.org/T398164) (owner: 10Brouberol) [13:33:00] phuedx: any concerns? see also this comment by Ladsgroup about how the query is run on Wikimedia infra: https://phabricator.wikimedia.org/T121516#10916810 (unrelated ticket, but same topic) [13:34:28] (03PS1) 10Slyngshede: Lock dulwich dependency at 0.22.1 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1165025 (https://phabricator.wikimedia.org/T397300) [13:35:27] sd0001: Thanks for that. Reading :) [13:35:50] (03PS1) 10JMeybohm: sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148) [13:35:55] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.99 [13:35:58] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b7 [13:36:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.95% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:36:25] (03CR) 10Slyngshede: "The issue can be reproduced by the following code:" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1165025 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede) [13:37:54] Alright, thanks for the party, phuedx [13:38:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet [13:38:26] (03CR) 10Ssingh: hiera: enable exporting prom metrics from doh1001 for anycast-hc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:40:02] jouncebot: nowandnext [13:40:02] For the next 0 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1300) [13:40:03] In 0 hour(s) and 49 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [13:40:34] sd0001: Just to confirm: The query has been optimized. The optimized query has been deployed and already been run (it was merged ~10 days ago) [13:40:37] (03PS2) 10Slyngshede: Lock dulwich dependency at 0.22.1 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1165025 (https://phabricator.wikimedia.org/T397300) [13:40:57] The config change is to use the cached results of that optimized query on enwiki? [13:41:00] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [13:41:51] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [13:41:58] phuedx: config change is to trigger the optimized query to run [13:42:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1 [13:42:46] (it will run the next time maintenance/updateSpecialPages.php is run) [13:43:10] until that time, Special:GadgetUsage will show some dummy values for active users as data won't be available [13:43:28] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough [13:43:38] I see [13:44:05] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [13:44:10] (03PS1) 10Jgiannelos: mobileapps: Fix staging config path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165028 [13:44:12] (03Abandoned) 10Jgiannelos: mobileapps: Use default logging level on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165009 (owner: 10Jgiannelos) [13:44:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.73% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:44:49] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.b7 [13:44:52] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.bb [13:44:58] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10958648 (10MoritzMuehlenhoff) [13:45:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164490 (https://phabricator.wikimedia.org/T397454) (owner: 10SD0001) [13:46:33] sd0001: It goes without saying: Please don't land the removal of the config flag in the Gadgets extension until you and Data Persistence are satisfied that the query is performant enough :) [13:46:41] (03Merged) 10jenkins-bot: Re-enable wgSpecialGadgetUsageActiveUsers for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164490 (https://phabricator.wikimedia.org/T397454) (owner: 10SD0001) [13:46:54] phuedx: sure [13:46:57] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1164490|Re-enable wgSpecialGadgetUsageActiveUsers for enwiki (T397454)]] [13:47:04] T397454: Show active user stats on Special:GadgetUsage in English Wikipedia - https://phabricator.wikimedia.org/T397454 [13:47:20] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test execution time - cmooney@cumin1003" [13:47:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test execution time - cmooney@cumin1003" [13:47:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:47:48] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1164999 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [13:48:35] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox and (A:dnsbox) [13:48:56] !log phuedx@deploy1003 phuedx, sd: Backport for [[gerrit:1164490|Re-enable wgSpecialGadgetUsageActiveUsers for enwiki (T397454)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:49:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:49:19] sd0001: Please check Special:GadgetUsage on the test servers :) [13:49:42] phuedx: looks good, can see the column for active users showing up [13:49:59] !log phuedx@deploy1003 phuedx, sd: Continuing with sync [13:50:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:51:05] (03CR) 10Ayounsi: [C:03+1] Lock dulwich dependency at 0.22.1 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1165025 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede) [13:51:19] (03CR) 10Ssingh: [V:03+1] "Thanks for the reminder; I will once I merge this." [puppet] - 10https://gerrit.wikimedia.org/r/1163858 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:55:26] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164490|Re-enable wgSpecialGadgetUsageActiveUsers for enwiki (T397454)]] (duration: 08m 28s) [13:55:32] T397454: Show active user stats on Special:GadgetUsage in English Wikipedia - https://phabricator.wikimedia.org/T397454 [13:55:39] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough [13:55:42] !log UTC afternoon backport window finished [13:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.bb [13:56:39] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.d3 [13:56:53] phuedx: thanks! [13:57:28] sd0001: yw. Thanks for pointing me at the QueryPage subsystem. TIL! [13:57:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:58:01] (03PS14) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 [13:58:18] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1186.eqiad.wmnet with OS bullseye [13:58:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10958700 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1186.eqiad.wmnet with OS b... [13:58:56] !log push pfw policies - T397875 [13:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:21] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [14:00:31] jouncebot: nowandnext [14:00:32] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [14:00:32] In 0 hour(s) and 29 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [14:00:58] phuedx: if you're comfortable, can you also run the script so that the actual counts of active users shows up? [14:01:06] see https://wikitech.wikimedia.org/wiki/Regenerate_cached_special_pages - although the instructions are old, command would need to be adjusted to use mwscript-k8s [14:01:26] (03PS2) 10JMeybohm: sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148) [14:02:12] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox [14:03:05] jmm@cumin2002 addnode (PID 3422069) is awaiting input [14:04:30] (03PS3) 10JMeybohm: sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148) [14:04:52] !log rolling restart of pybal on lvs201[34] [14:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:58] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.d3 [14:05:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:05:48] PROBLEM - very high load average likely xfs on thanos-be1007 is CRITICAL: CRITICAL - load average: 100.76, 100.05, 98.18 https://wikitech.wikimedia.org/wiki/Swift [14:06:16] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudcephosd200[567]-dev service implementation - https://phabricator.wikimedia.org/T397237#10958743 (10Andrew) p:05High→03Medium [14:06:30] sd0001: I have to run an errand before my next meeting. Could you ask in #wikimedia-data-persistence? [14:06:34] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [14:07:12] phuedx: okay! [14:07:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1 [14:08:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet [14:08:23] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti5005.eqsin.wmnet [14:08:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet [14:08:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10958758 (10ops-monitoring-bot) Draining ganeti5005.eqsin.wmnet of running VMs [14:08:57] !decommissioning Cassandra/sessionstore1004-a — T391544 [14:08:57] T391544: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544 [14:09:28] (03CR) 10Eevans: [C:03+2] sessionstore1004: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165013 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [14:09:32] urandom: !log? [14:09:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet [14:09:58] (03CR) 10Michael Große: Growth: Configure higher impact module edit limits for english and test wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164979 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [14:11:00] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [14:11:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) (owner: 10Michael Große) [14:12:20] (03CR) 10Clément Goubert: [C:03+1] sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [14:13:07] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox-future Generating DNS records from Netbox and syncing changes - demo run of new cookbook - cmooney@cumin1003 [14:13:14] (03CR) 10Vgutierrez: [C:03+2] service: Target upload.wm.o on upload-https healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1164466 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [14:13:32] FIRING: [2x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:13:46] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [14:14:03] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox-future (exit_code=0) Generating DNS records from Netbox and syncing changes - demo run of new cookbook - cmooney@cumin1003 [14:14:25] (03PS1) 10Jelto: gitlab: remove git_data_dirs setting [puppet] - 10https://gerrit.wikimedia.org/r/1165033 (https://phabricator.wikimedia.org/T394382) [14:15:21] (03PS1) 10Brouberol: airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165034 (https://phabricator.wikimedia.org/T398164) [14:16:14] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6108/co" [puppet] - 10https://gerrit.wikimedia.org/r/1165033 (https://phabricator.wikimedia.org/T394382) (owner: 10Jelto) [14:17:14] (03CR) 10Clément Goubert: [C:03+1] sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [14:18:31] (03PS4) 10Andrew Bogott: Openstack designate: use 'designate' service user instead of novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1164491 (https://phabricator.wikimedia.org/T273150) [14:18:33] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164491 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [14:18:46] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs4010.ulsfo.wmnet} and A:liberica (T394484) [14:18:53] T394484: Consider using a dedicated TLS certificate for upload.w.o - https://phabricator.wikimedia.org/T394484 [14:19:04] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs4010.ulsfo.wmnet} and A:liberica (T394484) [14:19:30] FIRING: LibericaStaleConfig: Liberica instance lvs4010 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=ulsfo&var-instance=lvs4010 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [14:19:55] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2003.codfw.wmnet with reason: Maintenance and reboot [14:20:28] (03CR) 10Btullis: [C:03+1] airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165034 (https://phabricator.wikimedia.org/T398164) (owner: 10Brouberol) [14:21:10] (03PS7) 10Tiziano Fogli: pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091) [14:23:04] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1186.eqiad.wmnet with reason: host reimage [14:23:47] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore1004.eqiad.wmnet with OS bullseye [14:23:51] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [14:23:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet [14:23:59] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10958835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1004.e... [14:24:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10958836 (10ops-monitoring-bot) Draining ganeti5005.eqsin.wmnet of running VMs [14:24:30] RESOLVED: LibericaStaleConfig: Liberica instance lvs4010 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=ulsfo&var-instance=lvs4010 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [14:25:34] (03CR) 10Brouberol: [C:03+2] airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165034 (https://phabricator.wikimedia.org/T398164) (owner: 10Brouberol) [14:26:42] (03PS1) 10Vgutierrez: cache::haproxy: Fix acl checks for unique path healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/1165035 (https://phabricator.wikimedia.org/T394484) [14:26:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1186.eqiad.wmnet with reason: host reimage [14:27:05] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165035 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [14:27:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:27:18] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:27:23] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test execution time - cmooney@cumin1003" [14:27:27] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test execution time - cmooney@cumin1003" [14:27:28] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:28:16] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:28:32] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:46] (03CR) 10Ssingh: [C:03+1] "Yes I made it in the review as well. Adds up." [puppet] - 10https://gerrit.wikimedia.org/r/1165035 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [14:29:55] (03CR) 10Brouberol: airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165034 (https://phabricator.wikimedia.org/T398164) (owner: 10Brouberol) [14:30:06] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [14:30:08] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Fix acl checks for unique path healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/1165035 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [14:32:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:36:09] (03PS2) 10Brouberol: airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165034 (https://phabricator.wikimedia.org/T398164) [14:36:49] 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://phabricator.wikimedia.org/T398024#10958898 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm errors ceased on the 27th. likely issues with xcon owned... [14:36:51] (03PS3) 10Brouberol: airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165034 (https://phabricator.wikimedia.org/T398164) [14:36:51] (03CR) 10Eevans: [C:03+2] sessionstore1004: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165014 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [14:39:36] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet [14:40:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:40:38] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1004.eqiad.wmnet with reason: host reimage [14:42:21] (03CR) 10Herron: [C:03+1] pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091) (owner: 10Tiziano Fogli) [14:43:38] (03CR) 10Brouberol: [C:03+2] airflow-main: increase max pod/container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165034 (https://phabricator.wikimedia.org/T398164) (owner: 10Brouberol) [14:44:00] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [14:44:10] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:44:24] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1004.eqiad.wmnet with reason: host reimage [14:44:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:44:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1186.eqiad.wmnet with OS bullseye [14:44:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10958918 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1186.eqiad.wmnet with OS bulls... [14:45:37] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:45:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:46:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [14:46:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [14:46:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10958938 (10Jclark-ctr) It looks like the system had a bad DAC cable. While running the provisioning script, it prompted me to select the PXE por... [14:47:00] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [14:47:05] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox and (A:dnsbox) [14:47:05] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10958939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2005.codfw.wmnet with OS bookworm [14:47:42] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:48:21] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:49:01] !log sukhe@dns1004 START - running authdns-update [14:49:14] !log running dummy authdns-update after service restarts on A:dnsbox [14:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:48] RECOVERY - very high load average likely xfs on thanos-be1007 is OK: OK - load average: 57.93, 68.68, 79.40 https://wikitech.wikimedia.org/wiki/Swift [14:49:58] !log sukhe@dns1004 END - running authdns-update [14:50:53] (03CR) 10Ssingh: "Very minor comments in-line, nothing related to functionality:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [14:50:59] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:51:48] RECOVERY - very high load average likely xfs on thanos-be1009 is OK: OK - load average: 54.45, 68.38, 79.76 https://wikitech.wikimedia.org/wiki/Swift [14:52:10] (03CR) 10Filippo Giunchedi: [C:03+1] pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091) (owner: 10Tiziano Fogli) [14:54:14] jhancock@cumin1003 provision (PID 3874714) is awaiting input [14:54:47] (03CR) 10Tiziano Fogli: [C:03+2] pushgateway: rotate logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1165020 (https://phabricator.wikimedia.org/T398091) (owner: 10Tiziano Fogli) [14:55:06] (03PS2) 10Hnowlan: changeprop: add missing total_delay histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164964 (https://phabricator.wikimedia.org/T397970) [14:55:57] (03PS3) 10Hnowlan: changeprop: add missing total_delay histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164964 (https://phabricator.wikimedia.org/T397970) [14:56:18] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:57:59] (03CR) 10Scott French: [C:03+1] changeprop: add missing total_delay histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164964 (https://phabricator.wikimedia.org/T397970) (owner: 10Hnowlan) [14:58:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:58:31] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10958984 (10MoritzMuehlenhoff) [14:58:54] (03PS4) 10Hnowlan: changeprop: add missing total_delay histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164964 (https://phabricator.wikimedia.org/T397970) [15:03:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:03:26] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1004.eqiad.wmnet with OS bullseye [15:03:37] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10959001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1004.eqiad... [15:04:37] (03CR) 10Hnowlan: [C:03+2] changeprop: add missing total_delay histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164964 (https://phabricator.wikimedia.org/T397970) (owner: 10Hnowlan) [15:05:42] !log eevans@cumin1003 START - Cookbook sre.hosts.reboot-single for host sessionstore1004.eqiad.wmnet [15:06:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:06:38] (03Merged) 10jenkins-bot: changeprop: add missing total_delay histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164964 (https://phabricator.wikimedia.org/T397970) (owner: 10Hnowlan) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:08] phuedx: there's been a fairly significant increase in worker saturation in mw-web since your patch was rolled out, do you know whether it might be the cause? https://grafana.wikimedia.org/goto/GW0OhZsHg?orgId=1 [15:10:07] !log jmm@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2020.codfw.wmnet with reason: remove for decom [15:10:38] hnowlan: I deployed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1164388 at around that time. I don't see how removing a config for an inactive event stream would increase worker saturation [15:11:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.94% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:11:16] (03PS1) 10Muehlenhoff: Remove ganeti2020 from Ganeti/codfw [puppet] - 10https://gerrit.wikimedia.org/r/1165042 (https://phabricator.wikimedia.org/T396590) [15:11:22] phuedx: yeah, seems relatively unlikely [15:12:03] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1004.eqiad.wmnet [15:12:06] (03CR) 10Btullis: [C:03+2] Dumps_v1: Disable the sync job that publishes from dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/1164150 (https://phabricator.wikimedia.org/T397848) (owner: 10Btullis) [15:12:15] (03CR) 10Btullis: [C:03+2] Dumps_v1: Stop updating dumps monitor HTML/JSON from the legacy system [puppet] - 10https://gerrit.wikimedia.org/r/1164157 (https://phabricator.wikimedia.org/T397848) (owner: 10Btullis) [15:12:15] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup2003.codfw.wmnet: Renew puppet certificate - jynus@cumin1003 [15:12:36] hnowlan: Same for the config change I deployed immediately after it – adding the ability for bureaucrats on frwiki to add/remove a group via the wgAdd- and wgRemoveGroup variables [15:12:38] Seems unlikely [15:12:42] But the timing is suspect [15:13:41] phuedx: seems there might be an external factor [15:15:27] hnowlan: Kinda agree. Worker saturation also seemed to increase during the backport window this morning (UTC) [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:52] !bootstrapping Cassandra/sessionstore1004-a — T391544 [15:16:52] T391544: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544 [15:19:34] FIRING: [2x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:23:50] (03CR) 10Tiziano Fogli: [C:03+2] LibericaEtcdErrors: disable pint check for missing metrics [alerts] - 10https://gerrit.wikimedia.org/r/1164957 (https://phabricator.wikimedia.org/T396320) (owner: 10Tiziano Fogli) [15:23:54] (03CR) 10Andrew Bogott: [C:03+2] Openstack designate: use 'designate' service user instead of novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1164491 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [15:24:05] (03CR) 10Tiziano Fogli: [C:03+2] PyBalBGPUnstable: disable pint check for missing metrics [alerts] - 10https://gerrit.wikimedia.org/r/1164959 (https://phabricator.wikimedia.org/T396321) (owner: 10Tiziano Fogli) [15:24:31] (03PS2) 10Eevans: sessionstore1005: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165015 (https://phabricator.wikimedia.org/T391544) [15:24:31] (03PS2) 10Eevans: sessionstore1005: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165016 (https://phabricator.wikimedia.org/T391544) [15:24:31] (03PS2) 10Eevans: sessionstore1006: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165017 (https://phabricator.wikimedia.org/T391544) [15:24:31] (03PS2) 10Eevans: sessionstore1006: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165018 (https://phabricator.wikimedia.org/T391544) [15:24:32] (03PS2) 10Eevans: sessionstore: preseed eqiad servers for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1165019 (https://phabricator.wikimedia.org/T391544) [15:24:39] 07sre-alert-triage, 06SRE Observability, 06Traffic, 13Patch-For-Review: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T396321#10959156 (10tappof) 05Open→03Resolved a:03tappof [15:24:49] 07sre-alert-triage, 06SRE Observability, 06Traffic, 13Patch-For-Review: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T396320#10959159 (10tappof) 05Open→03Resolved a:03tappof [15:27:27] 06SRE: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#10959192 (10Joe) [15:28:02] (03CR) 10David Caro: [C:03+1] "LGTM, I'd prefer using epp templates (so it does type check, empty etc.) but I understand it's a bit more tedious." [puppet] - 10https://gerrit.wikimedia.org/r/1160104 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [15:29:16] (03CR) 10Eevans: [C:03+2] sessionstore1005: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165015 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [15:29:45] (03PS2) 10Majavah: natlog: Use a separate journald namespace with no storage [puppet] - 10https://gerrit.wikimedia.org/r/1160117 (https://phabricator.wikimedia.org/T273734) [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [15:30:05] jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1530). [15:30:18] 06SRE, 06Data-Engineering: Include accept-language header in turnilo/superset - https://phabricator.wikimedia.org/T398213 (10Joe) 03NEW [15:37:14] (03PS1) 10Jcrespo: dbbackups: Remove backup1002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165050 (https://phabricator.wikimedia.org/T398210) [15:37:15] (03PS1) 10Jcrespo: dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) [15:37:36] (03PS2) 10Jcrespo: dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) [15:37:56] (03PS2) 10Jcrespo: dbbackups: Remove backup1002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165050 (https://phabricator.wikimedia.org/T398210) [15:37:59] (03CR) 10CI reject: [V:04-1] dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) (owner: 10Jcrespo) [15:38:07] (03CR) 10CI reject: [V:04-1] dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) (owner: 10Jcrespo) [15:38:16] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Follow up on lists.wm.o TLS usage - https://phabricator.wikimedia.org/T398018#10959306 (10LSobanski) @Vgutierrez who would be doing the work listed in the bullet points, you or us? [15:38:45] (03PS3) 10Jcrespo: dbbackups: Remove backup1002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165050 (https://phabricator.wikimedia.org/T398210) [15:39:38] (03PS4) 10Jcrespo: dbbackups: Remove backup1002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165050 (https://phabricator.wikimedia.org/T398210) [15:40:08] (03PS1) 10Majavah: Remove root keys for former staff [labs/private] - 10https://gerrit.wikimedia.org/r/1165052 [15:40:12] (03PS5) 10Jcrespo: dbbackups: Remove backup1002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165050 (https://phabricator.wikimedia.org/T398210) [15:40:54] (03PS3) 10Jcrespo: dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) [15:41:29] (03CR) 10CI reject: [V:04-1] dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) (owner: 10Jcrespo) [15:41:37] (03PS4) 10Jcrespo: dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) [15:42:12] (03CR) 10CI reject: [V:04-1] dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) (owner: 10Jcrespo) [15:42:45] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10959333 (10cmooney) [15:43:06] (03PS5) 10Jcrespo: dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) [15:44:08] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:44:46] 06SRE, 06Infrastructure-Foundations, 10netbox: Netbox script for adding secondary IPs - https://phabricator.wikimedia.org/T378730#10959364 (10cmooney) a:03cmooney @Eevans sorry this one escaped me somehow let me take a look, agreed it seems there is something wrong here. [15:45:42] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:47:31] !log jynus@cumin1003 START - Cookbook sre.hosts.decommission for hosts backup1002.eqiad.wmnet [15:50:07] jhancock@cumin1003 provision (PID 3881940) is awaiting input [15:52:35] 06SRE, 06cloud-services-team, 10DNS, 06Infrastructure-Foundations, and 2 others: Cloud: define relationship between wikimediacloud.org domain, CIDR prefixes and netbox automation - https://phabricator.wikimedia.org/T266331#10959423 (10ayounsi) 05Open→03Declined Closing for now, please reopen if nee... [15:53:28] FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:32] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [15:53:47] !log jynus@cumin1003 START - Cookbook sre.dns.netbox [15:56:58] !log jynus@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [15:57:24] !log jynus@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [15:57:24] !log jynus@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:57:25] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts backup1002.eqiad.wmnet [15:59:14] (03CR) 10Jcrespo: [C:03+2] dbbackups: Remove backup1002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165050 (https://phabricator.wikimedia.org/T398210) (owner: 10Jcrespo) [15:59:30] FIRING: [2x] LibericaStaleConfig: Liberica instance lvs6002 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [16:00:01] urandom:sessionstore1005: reimage for JBOD configuration (c95b94c7ee) [16:00:18] should I send it or wait? [16:01:58] ^ liberica|pybal alerts are expected [16:02:10] I'm 99% sure this was something you were working in, urandom , but want to make sure the merging was intended [16:04:30] FIRING: [5x] LibericaStaleConfig: Liberica instance lvs3009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [16:06:56] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs6002.drmrs.wmnet,lvs5005.eqsin.wmnet,lvs3009.esams.wmnet,lvs7002.magru.wmnet,lvs4009.ulsfo.wmnet} and A:liberica (T394484) [16:06:58] jynus: oh I'm sorry [16:07:02] T394484: Consider using a dedicated TLS certificate for upload.w.o - https://phabricator.wikimedia.org/T394484 [16:07:03] yes, it can be merged [16:07:10] ...thought I had [16:07:33] !log manually update GadgetUsage on enwiki T397454 [16:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:39] T397454: Show active user stats on Special:GadgetUsage in English Wikipedia - https://phabricator.wikimedia.org/T397454 [16:07:48] urandom: done [16:07:57] jynus: thanks! [16:08:14] jhancock@cumin1003 reimage (PID 3873963) is awaiting input [16:08:29] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [16:08:29] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs6002.drmrs.wmnet,lvs5005.eqsin.wmnet,lvs3009.esams.wmnet,lvs7002.magru.wmnet,lvs4009.ulsfo.wmnet} and A:liberica (T394484) [16:09:29] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2012 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [16:09:30] FIRING: [6x] LibericaStaleConfig: Liberica instance lvs3009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [16:09:55] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission backup1002 and its disk array - https://phabricator.wikimedia.org/T398210#10959586 (10jcrespo) [16:12:14] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission backup1002 and its disk array - https://phabricator.wikimedia.org/T398210#10959604 (10jcrespo) @Jclark-ctr @VRiley-WMF It is my understanding that these arrays don't have a network interface to disable/DNS to handle, but ofc they will have to be ha... [16:12:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:12:20] 06SRE, 10LDAP-Access-Requests: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10959605 (10DerHexer) I must have overlooked the question for confirmation, I'm sorry. I had tested it immediately and it's working well for me, thank you! [16:13:31] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:13:57] !log jynus@cumin1003 START - Cookbook sre.hosts.decommission for hosts backup2002.codfw.wmnet [16:14:10] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs6003.drmrs.wmnet,lvs5006.eqsin.wmnet,lvs3010.esams.wmnet,lvs7003.magru.wmnet,lvs4010.ulsfo.wmnet} and A:liberica (T394484) [16:14:16] T394484: Consider using a dedicated TLS certificate for upload.w.o - https://phabricator.wikimedia.org/T394484 [16:14:22] 06SRE, 10LDAP-Access-Requests: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10959621 (10Clement_Goubert) 05Stalled→03Resolved Thanks for confirming! [16:14:39] !log decommissioning Cassandra/sessionstore1005-a — T391544 [16:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:45] T391544: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544 [16:14:53] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [16:14:57] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10959628 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host sretest2005.codfw.wmnet with OS bookworm executed with errors: - sretest2005 (... [16:15:41] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs6003.drmrs.wmnet,lvs5006.eqsin.wmnet,lvs3010.esams.wmnet,lvs7003.magru.wmnet,lvs4010.ulsfo.wmnet} and A:liberica (T394484) [16:17:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:18:29] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [16:19:29] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2012 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [16:19:30] RESOLVED: [9x] LibericaStaleConfig: Liberica instance lvs3009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [16:19:31] !log jynus@cumin1003 START - Cookbook sre.dns.netbox [16:19:34] FIRING: ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore1005-a:7000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:47] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore1005.eqiad.wmnet with OS bullseye [16:19:57] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10959648 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1005.e... [16:23:12] !log jynus@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [16:23:32] FIRING: [3x] ProbeDown: Service sessionstore1004-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:32] !log jynus@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [16:23:33] !log jynus@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:23:33] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts backup2002.codfw.wmnet [16:24:38] (03PS1) 10Clare Ming: Enable experiment configs fetching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165060 (https://phabricator.wikimedia.org/T397144) [16:24:44] (03CR) 10Jcrespo: [C:03+2] dbbackups: Remove backup2002 from production [puppet] - 10https://gerrit.wikimedia.org/r/1165051 (https://phabricator.wikimedia.org/T398212) (owner: 10Jcrespo) [16:26:46] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 13Patch-For-Review: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10959688 (10DLynch) a:03DLynch [16:26:50] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission backup2002 and its disk array - https://phabricator.wikimedia.org/T398212#10959689 (10jcrespo) [16:26:55] (03PS2) 10Clare Ming: Enable experiment configs fetching for group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165060 (https://phabricator.wikimedia.org/T397144) [16:28:09] !log joal@deploy1003 Started deploy [airflow-dags/analytics_test@3c90af1]: Synchronize artifacat for airflow_dags/analytics_test [16:28:22] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission backup2002 and its disk array - https://phabricator.wikimedia.org/T398212#10959703 (10jcrespo) @Jhancock.wm It is my understanding that these arrays don't have a network interface to disable/DNS to handle, but ofc they will have to be handled physi... [16:28:24] !log joal@deploy1003 Finished deploy [airflow-dags/analytics_test@3c90af1]: Synchronize artifacat for airflow_dags/analytics_test (duration: 00m 15s) [16:28:54] !log joal@deploy1003 Started deploy [airflow-dags/analytics@3c90af1]: Synchronize artifacat for airflow_dags/analytics_test [16:29:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165060 (https://phabricator.wikimedia.org/T397144) (owner: 10Clare Ming) [16:29:33] !log joal@deploy1003 Finished deploy [airflow-dags/analytics@3c90af1]: Synchronize artifacat for airflow_dags/analytics_test (duration: 00m 38s) [16:31:28] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: civi2001 - https://phabricator.wikimedia.org/T397380#10959726 (10Jgreen) a:05Dwisehaupt→03None [16:31:46] (03CR) 10Phuedx: [C:03+1] Enable experiment configs fetching for group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165060 (https://phabricator.wikimedia.org/T397144) (owner: 10Clare Ming) [16:32:09] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore1005.eqiad.wmnet with OS bullseye [16:32:19] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10959734 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1005.eqiad... [16:32:43] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore1005.eqiad.wmnet with OS bullseye [16:32:52] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10959736 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1005.e... [16:39:12] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [16:39:17] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10959771 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2005.codfw.wmnet with OS bookworm [16:40:27] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10959779 (10Jhancock.wm) [16:45:02] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore1005.eqiad.wmnet with OS bullseye [16:45:16] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10959793 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1005.eqiad... [16:45:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet [16:45:30] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore1005.eqiad.wmnet with OS bullseye [16:45:47] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10959794 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1005.e... [16:49:34] FIRING: [2x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:49:46] (03PS15) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 [16:50:54] (03CR) 10JHathaway: "@rcoccioli@wikimedia.org I think this is ready for a second pass, when you have the time, thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [16:52:14] jhancock@cumin1003 provision (PID 3881940) is awaiting input [16:53:32] RESOLVED: [2x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:54:26] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:54:31] (03PS3) 10Daniuu: nlwiki: add VRT agent user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) [16:55:26] 10ops-eqiad, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225 (10Eevans) 03NEW [16:55:33] 10ops-eqiad, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10959846 (10Eevans) p:05Triage→03Unbreak! [16:56:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [skins/MinervaNeue] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164474 (https://phabricator.wikimedia.org/T397539) (owner: 10Bernard Wang) [16:56:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164475 (https://phabricator.wikimedia.org/T397469) (owner: 10Bernard Wang) [16:56:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164475 (https://phabricator.wikimedia.org/T397469) (owner: 10Bernard Wang) [16:57:35] !log tchin@deploy1003 Started deploy [airflow-dags/analytics@74e8d66]: Deploying artifacts for T388439 [16:57:42] T388439: Add metrics for monthly reconciles - https://phabricator.wikimedia.org/T388439 [16:58:11] !log tchin@deploy1003 Finished deploy [airflow-dags/analytics@74e8d66]: Deploying artifacts for T388439 (duration: 00m 52s) [17:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [17:00:05] swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1700). [17:00:05] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1700). [17:00:15] o/ [17:00:41] the work I'd originally planned for this window will be deferred to a later date TBD [17:02:37] jhancock@cumin1003 reimage (PID 3888301) is awaiting input [17:03:25] (03CR) 10Urbanecm: [C:04-1] "should be good to go once the privileged groups denotation is fixed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu) [17:09:56] (03CR) 10FNegri: [C:03+1] Remove root keys for former staff [labs/private] - 10https://gerrit.wikimedia.org/r/1165052 (owner: 10Majavah) [17:15:04] (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Replace analytics fake headers with vars [puppet] - 10https://gerrit.wikimedia.org/r/1147912 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [17:21:57] (03CR) 10Majavah: [V:03+2 C:03+2] Remove root keys for former staff [labs/private] - 10https://gerrit.wikimedia.org/r/1165052 (owner: 10Majavah) [17:23:50] (03CR) 10Michael Große: [C:04-1] "This should have had a -1, because a change is needed (see previous comment). But for the max-edit-limit, this is indeed what we want." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164979 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [17:23:54] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:25:14] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165065 [17:28:00] jhancock@cumin1003 provision (PID 3893852) is awaiting input [17:33:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) (owner: 10Michael Große) [17:34:28] (03Merged) 10jenkins-bot: Growth(enwiki): enable limiting Add a Link to new editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164969 (https://phabricator.wikimedia.org/T386034) (owner: 10Michael Große) [17:34:44] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1164969|Growth(enwiki): enable limiting Add a Link to new editors (T386034)]] [17:34:50] T386034: Add a Link: Community Configuration setting to allow limiting "Add a Link" to new editors - https://phabricator.wikimedia.org/T386034 [17:35:35] (03Abandoned) 10Jgiannelos: mobileapps: Fix staging config path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165028 (owner: 10Jgiannelos) [17:36:34] (03PS4) 10Daniuu: nlwiki: add VRT agent user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) [17:36:39] !log urbanecm@deploy1003 migr, urbanecm: Backport for [[gerrit:1164969|Growth(enwiki): enable limiting Add a Link to new editors (T386034)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:38:31] urbanecm: so far, I'm not seeing any obvious problems with a user having NO add-a-link and only the legacy template-based links task [17:38:40] MichaelG_WMF: me neither [17:39:53] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397983#10960057 (10phaultfinder) [17:40:16] (03PS1) 10Jgiannelos: mobileapps: Use profiler script to spawn profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165068 [17:40:39] 10SRE-SLO, 10Observability-Metrics, 10SRE Observability (FY2024/2025-Q4): liftwing SLO performance issues - https://phabricator.wikimedia.org/T387350#10960064 (10herron) 05Open→03Resolved Optimistically resolving as we've tuned the window for istio slos to 4w (from 12w) [17:40:52] urbanecm: Ok, then I'd say let's move forward? [17:40:57] sure! [17:40:59] !log urbanecm@deploy1003 migr, urbanecm: Continuing with sync [17:41:45] (03PS2) 10Jgiannelos: mobileapps: Use profiler script to spawn profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165068 (https://phabricator.wikimedia.org/T397750) [17:42:25] (03CR) 10CI reject: [V:04-1] mobileapps: Use profiler script to spawn profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165068 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [17:42:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:45:20] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:45:53] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:46:30] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164969|Growth(enwiki): enable limiting Add a Link to new editors (T386034)]] (duration: 11m 46s) [17:46:34] MichaelG_WMF: should be deployed [17:46:36] T386034: Add a Link: Community Configuration setting to allow limiting "Add a Link" to new editors - https://phabricator.wikimedia.org/T386034 [17:47:10] urbanecm: thanks, I'll check without mwdebug! [17:48:09] sounds good! [17:48:55] Looks good, as far as I can tell [17:48:57] jhancock@cumin1003 provision (PID 3894577) is awaiting input [17:49:18] let's see how things develop and also with the upcoming train [17:51:45] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-ntp (exit_code=0) rolling restart_daemons on A:dnsbox [17:52:16] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10960126 (10RobH) @Jclark-ctr & @Jhancock.wm: Please note this was pinged in IRC as well, if either of you are on-site today/next, please address this issue. [17:52:36] MichaelG_WMF: i certainly hope train won't break :) [17:52:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:53:28] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10960129 (10Jhancock.wm) it's a backplane communication error. someone onsite needs to reseat the cables to the backplane. 90% chance that fixes it. [17:53:59] urbanecm: I don't expect it to. We tested this in beta and all we found were minor UI issues, nothing that would break a train. [17:54:34] Still, next time it would be nice to have it enabled in testwiki early. [17:55:33] jhancock@cumin1003 provision (PID 3894577) is awaiting input [17:55:47] !log Implement Varnish vmod_var-based X-Analytics formatting - T373550 [17:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:53] T373550: Move varnish pseudo-headers to vmod_var variables - https://phabricator.wikimedia.org/T373550 [17:58:58] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10960145 (10BCornwall) 05Open→03In progress p:05Triage→03Medium a:05RobH→03BCornwall [18:00:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:04:06] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10960165 (10Jhancock.wm) @jcrespo hey i finished testing on this server. Do you want to take it for a spin? it's the new 1CPU Config-K. (note, the re-image is going to come back as faile... [18:04:43] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10960181 (10Jhancock.wm) 05Open→03Resolved [18:05:41] !log eevans@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore1005.eqiad.wmnet with OS bullseye [18:05:53] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10960191 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1005.eqiad... [18:05:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:09:31] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:12:13] (03PS4) 10BCornwall: varnish: Implement translation analytics vars [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) [18:15:04] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:17:54] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:18:19] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:18:56] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:19:17] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [18:19:46] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:19:57] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:23:20] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sretest2010 to codfw - jhancock@cumin2002" [18:23:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sretest2010 to codfw - jhancock@cumin2002" [18:23:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:25:18] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host sretest2010 [18:27:47] (03PS1) 10C. Scott Ananian: Disable ParserMigration indicator and user notice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) [18:28:21] (03PS5) 10BCornwall: varnish: Implement translation analytics vars [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) [18:28:21] jhancock@cumin2002 configure-switch-interfaces (PID 3476591) is awaiting input [18:28:32] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:28:38] (03CR) 10Daniuu: "Removed the deleted permissions for now. If needed, we can add them again in a later patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu) [18:30:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2010 [18:31:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:32:15] (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu) [18:32:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:33:09] (03CR) 10CI reject: [V:04-1] nlwiki: add VRT agent user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu) [18:33:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:34:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:35:03] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [18:35:11] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10960265 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host sretest2005.codfw.wmnet with OS bookworm executed with errors: - sretest20... [18:36:53] (03PS5) 10Daniuu: nlwiki: add VRT agent user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) [18:38:19] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 10MW-1.45-notes (1.45.0-wmf.8; 2025-07-01): Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10960272 (10DLynch) @elukey Okay, this has made it to the train for this week, so we s... [18:38:30] jhancock@cumin2002 provision (PID 3479969) is awaiting input [18:40:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:41:10] (03PS6) 10Daniuu: nlwiki: add VRT agent user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) [18:41:30] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu) [18:41:59] !log eevans@cumin1003 START - Cookbook sre.discovery.service-route check sessionstore: maintenance [18:41:59] !log eevans@cumin1003 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check sessionstore: maintenance [18:42:08] (03CR) 10BCornwall: [V:03+2] "They are (after fixing a bad rebase):" [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [18:42:09] jhancock@cumin1003 provision (PID 3897463) is awaiting input [18:43:26] !log eevans@cumin1003 START - Cookbook sre.discovery.service-route check sessionstore: maintenance [18:43:26] !log eevans@cumin1003 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check sessionstore: maintenance [18:44:00] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [18:46:29] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu) [18:46:34] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:47:33] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc2002.codfw.wmnet with reason: Decom [18:47:43] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:47:46] (03CR) 10AOkoth: [C:03+2] doc: decom doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [18:48:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu) [18:50:06] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:50:28] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:50:56] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54083 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:51:10] !log aokoth@cumin1002 START - Cookbook sre.hosts.decommission for hosts doc2002.codfw.wmnet [18:51:18] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:55:20] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10960323 (10Eevans) >>! In T398225#10960125, @RobH wrote: > @Jclark-ctr & @Jhancock.wm: Please note this was pinged in IRC as well, if either of you are on-site today/next, please address this issue. Shou... [18:55:42] (03PS1) 10Jgiannelos: mobileapps: Use GET instead of POST for MW API requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165110 [18:55:53] !log aokoth@cumin1002 START - Cookbook sre.dns.netbox [18:55:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:59:17] FIRING: [2x] ProbeDown: Service doc1004.eqiad.wmnet:443 has failed probes (http_doc1004_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:59:24] !log aokoth@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doc2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1002" [19:00:33] !log aokoth@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doc2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1002" [19:00:33] !log aokoth@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:00:33] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doc2002.codfw.wmnet [19:02:02] (03PS2) 10Jgiannelos: mobileapps: Use GET instead of POST for MW API requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165110 (https://phabricator.wikimedia.org/T398167) [19:02:51] (03CR) 10Urbanecm: [C:04-1] mobileapps: Use GET instead of POST for MW API requests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165110 (https://phabricator.wikimedia.org/T398167) (owner: 10Jgiannelos) [19:03:39] (03CR) 10Urbanecm: [C:04-1] "curiosity question: how does this solve the issue? I'd like to understand the root cause here, if possible." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165110 (https://phabricator.wikimedia.org/T398167) (owner: 10Jgiannelos) [19:05:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:19:17] RESOLVED: [2x] ProbeDown: Service doc1004.eqiad.wmnet:443 has failed probes (http_doc1004_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:28:45] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [19:45:42] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:53:32] FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:53:32] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [20:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T2000). [20:00:05] EggRoll97, tgr, MichaelG_WMF, cjming, and bwang: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [20:00:19] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960707 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2006.codfw.wmnet with OS bookworm [20:00:26] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm [20:00:31] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960710 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host sretest2006.codfw.wmnet with OS bookworm executed with errors: - sretest2006 (... [20:01:16] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [20:01:25] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2006.codfw.wmnet with OS bookworm [20:01:37] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm [20:01:40] o/ [20:01:41] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host sretest2006.codfw.wmnet with OS bookworm executed with errors: - sretest2006 (... [20:01:49] o/ [20:01:53] i can deploy for those not self-deploying [20:02:21] hi [20:02:28] my patch is safe to bundle with other stuff [20:02:30] Yes I have 2 things but im not self deploying [20:03:47] EggRoll97: are you here? [20:04:23] tgr: maybe I'll do yours and mine together? [20:04:40] MichaelG_WMF: how about you? [20:04:51] bwang: can all 3 of your go out together? [20:04:57] *yours [20:05:21] Its only 2 [20:05:23] But yes [20:05:52] cool - ok i'll start with mine and tgr's [20:06:19] (03PS2) 10Gergő Tisza: Revert "Add scrambled: password class" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159568 (https://phabricator.wikimedia.org/T395360) [20:06:44] (03PS3) 10Clare Ming: Enable experiment configs fetching for group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165060 (https://phabricator.wikimedia.org/T397144) [20:07:18] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:08:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159568 (https://phabricator.wikimedia.org/T395360) (owner: 10Gergő Tisza) [20:08:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165060 (https://phabricator.wikimedia.org/T397144) (owner: 10Clare Ming) [20:09:05] (03Merged) 10jenkins-bot: Revert "Add scrambled: password class" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159568 (https://phabricator.wikimedia.org/T395360) (owner: 10Gergő Tisza) [20:09:12] (03Merged) 10jenkins-bot: Enable experiment configs fetching for group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165060 (https://phabricator.wikimedia.org/T397144) (owner: 10Clare Ming) [20:09:30] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1159568|Revert "Add scrambled: password class" (T395360 T395372)]], [[gerrit:1165060|Enable experiment configs fetching for group 0 (T397144)]] [20:09:38] T395372: Handle scrambled password type in CentralAuth - https://phabricator.wikimedia.org/T395372 [20:09:38] T397144: MetricsPlatform: Enable experiment config fetching - https://phabricator.wikimedia.org/T397144 [20:11:27] !log cjming@deploy1003 cjming, tgr: Backport for [[gerrit:1159568|Revert "Add scrambled: password class" (T395360 T395372)]], [[gerrit:1165060|Enable experiment configs fetching for group 0 (T397144)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:11:59] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165150 [20:12:41] cjming: mine works [20:12:54] oh wait, forgot to use XWD [20:13:31] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:13:35] okay still works [20:13:40] nice - will sync [20:13:53] !log cjming@deploy1003 cjming, tgr: Continuing with sync [20:14:17] bwang: i'll do both of yours next [20:14:48] Thank you [20:19:27] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159568|Revert "Add scrambled: password class" (T395360 T395372)]], [[gerrit:1165060|Enable experiment configs fetching for group 0 (T397144)]] (duration: 09m 57s) [20:19:35] T395372: Handle scrambled password type in CentralAuth - https://phabricator.wikimedia.org/T395372 [20:19:35] T397144: MetricsPlatform: Enable experiment config fetching - https://phabricator.wikimedia.org/T397144 [20:20:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [skins/MinervaNeue] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164474 (https://phabricator.wikimedia.org/T397539) (owner: 10Bernard Wang) [20:20:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164475 (https://phabricator.wikimedia.org/T397469) (owner: 10Bernard Wang) [20:21:17] (03Merged) 10jenkins-bot: Prevent extra scrolling when dialog is open on ios [skins/MinervaNeue] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164474 (https://phabricator.wikimedia.org/T397539) (owner: 10Bernard Wang) [20:26:04] Lmk when its ready to test! [20:26:39] will do - just waiting for your core backport to merge [20:27:04] (03PS1) 10Andrew Bogott: Openstack web proxy: allow 'proxyadmin' users to modify proxies [puppet] - 10https://gerrit.wikimedia.org/r/1165154 (https://phabricator.wikimedia.org/T273150) [20:27:05] (03PS1) 10Andrew Bogott: Openstack web proxy: allow 'puppetencadmin' users to modify per-vm puppet config [puppet] - 10https://gerrit.wikimedia.org/r/1165155 (https://phabricator.wikimedia.org/T273150) [20:28:44] (03PS1) 10Scott French: aptrepo: add php83 component and pcre2 updates [puppet] - 10https://gerrit.wikimedia.org/r/1165151 (https://phabricator.wikimedia.org/T398245) [20:28:44] (03CR) 10Scott French: "Thanks in advance for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1165151 (https://phabricator.wikimedia.org/T398245) (owner: 10Scott French) [20:28:45] (03PS2) 10Scott French: package_builder: add pbuilder hook for component/php83 [puppet] - 10https://gerrit.wikimedia.org/r/1165152 (https://phabricator.wikimedia.org/T398245) [20:31:01] (03CR) 10Andrew Bogott: [C:03+1] P:wmcs: ntp: Automatically restart the service after config changes [puppet] - 10https://gerrit.wikimedia.org/r/1164970 (https://phabricator.wikimedia.org/T398099) (owner: 10Majavah) [20:32:28] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960902 (10Jhancock.wm) [20:33:16] (03Merged) 10jenkins-bot: Add workaround for iOS to ensure the virtual keyboard is opened when the mobile TAHS overlay is opened [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164475 (https://phabricator.wikimedia.org/T397469) (owner: 10Bernard Wang) [20:33:27] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:33:34] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1164474|Prevent extra scrolling when dialog is open on ios (T397539)]], [[gerrit:1164475|Add workaround for iOS to ensure the virtual keyboard is opened when the mobile TAHS overlay is opened (T397469)]] [20:33:41] T397539: Fix background scrolling on new mobile search experience - https://phabricator.wikimedia.org/T397539 [20:33:41] T397469: Remove extra tap when opening search bar on minerva - https://phabricator.wikimedia.org/T397469 [20:35:28] !log cjming@deploy1003 cjming, bwang: Backport for [[gerrit:1164474|Prevent extra scrolling when dialog is open on ios (T397539)]], [[gerrit:1164475|Add workaround for iOS to ensure the virtual keyboard is opened when the mobile TAHS overlay is opened (T397469)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:35:34] bwang: on test servers if you want to check ^^ [20:36:18] lmk if/when to sync [20:38:44] Ok done! [20:38:51] Go ahead [20:38:51] cool ! ok to sync? [20:38:54] nice [20:39:02] !log cjming@deploy1003 cjming, bwang: Continuing with sync [20:41:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:41:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:42:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:42:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:44:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:44:27] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [20:44:32] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960932 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2006.codfw.wmnet with OS bookworm [20:44:44] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:44:57] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164474|Prevent extra scrolling when dialog is open on ios (T397539)]], [[gerrit:1164475|Add workaround for iOS to ensure the virtual keyboard is opened when the mobile TAHS overlay is opened (T397469)]] (duration: 11m 23s) [20:45:02] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm [20:45:04] T397539: Fix background scrolling on new mobile search experience - https://phabricator.wikimedia.org/T397539 [20:45:04] T397469: Remove extra tap when opening search bar on minerva - https://phabricator.wikimedia.org/T397469 [20:45:08] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960935 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host sretest2006.codfw.wmnet with OS bookworm executed with errors: - sretest2006 (... [20:46:01] EggRoll97 + MichaelG_WMF: if you're around and want to self-deploy, please go ahed -- if you need a deployer, please ping me and I can deploy for you [20:46:10] *ahead [20:46:31] I'll leave the backport window open for a few minutes longer [20:48:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [20:49:04] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2006.codfw.wmnet with OS bookworm [20:49:18] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm [20:49:22] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10960968 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2006.codfw.wmnet with OS bookworm executed with errors: - sretest2006 (... [21:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [21:00:05] Reedy, sbassett, Maryum, and manfredi: Time to do the Weekly Security deployment window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T2100). [21:05:31] (03PS1) 10Eevans: fix cookbook names in example text [cookbooks] - 10https://gerrit.wikimedia.org/r/1165161 [21:19:58] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397658#10961058 (10phaultfinder) [21:23:15] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10961061 (10Jhancock.wm) @elukey trying to figure out why this reimage script isn't working on this test server. it has a raid and a boss card. the boss card has a raid 1 between the two... [21:24:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 23.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:29:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 23.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:36:24] (03CR) 10Scott French: [C:03+1] fix cookbook names in example text [cookbooks] - 10https://gerrit.wikimedia.org/r/1165161 (owner: 10Eevans) [21:37:47] (03CR) 10Btullis: [V:03+1 C:03+2] Ensure that master=yarn is the default spark configuration for users [puppet] - 10https://gerrit.wikimedia.org/r/1164272 (https://phabricator.wikimedia.org/T393181) (owner: 10Btullis) [21:39:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:44:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.72% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:47:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:52:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:54:45] FIRING: CirrusStreamingUpdaterFlinkNoRegisteredTask: ... [21:54:45] cirrus-streaming-updater job in eqiad (k8s) is running without any taskmanagers - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic-backfill - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkNoRegisteredTask [21:59:45] RESOLVED: CirrusStreamingUpdaterFlinkNoRegisteredTask: ... [21:59:45] cirrus-streaming-updater job in eqiad (k8s) is running without any taskmanagers - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic-backfill - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkNoRegisteredTask [22:03:03] (03CR) 10Ladsgroup: "Thank you and sorry for such a dumb mistake" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164984 (owner: 10Gmodena) [22:05:23] (03PS1) 10Ladsgroup: Revert^2 "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 [22:06:21] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10961146 (10Htriedman) Hi @Clement_Goubert! When I navigate to the L3 document page, there's no option to sign again — any way I... [22:15:28] (03PS2) 10Ladsgroup: Revert^2 "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 [22:16:27] (03CR) 10CI reject: [V:04-1] Revert^2 "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 (owner: 10Ladsgroup) [22:18:40] (03PS3) 10Ladsgroup: Revert^2 "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 [22:21:09] (03PS4) 10Cwhite: logstash: filter_on_template_v2 fixes [puppet] - 10https://gerrit.wikimedia.org/r/1163486 (https://phabricator.wikimedia.org/T234565) [22:21:38] (03PS3) 10Cwhite: logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565) [22:24:53] (03PS4) 10Cwhite: logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565) [22:26:23] (03CR) 10Cwhite: [C:03+2] logstash: filter_on_template_v2 fixes [puppet] - 10https://gerrit.wikimedia.org/r/1163486 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:28:32] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:38:43] (03PS1) 10Cwhite: logstash: re-enable filter_on_template_v2 on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1165173 [22:40:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:44:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:45:42] (03CR) 10Cwhite: [C:03+2] logstash: re-enable filter_on_template_v2 on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1165173 (owner: 10Cwhite) [22:46:14] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 331 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:47:14] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29767 bytes in 0.211 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:47:42] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:57:28] FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:59:20] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [22:59:32] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10961285 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host sretest2006.codfw.wmnet with OS bookworm [22:59:39] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm [22:59:46] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10961286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host sretest2006.codfw.wmnet with OS bookworm executed with errors: - sretest2006 (**... [23:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T2300) [23:02:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:09:59] (03PS1) 10Tim Starling: uppercaseTitlesForUnicodeTransition: Add file table [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1165179 (https://phabricator.wikimedia.org/T383496) [23:15:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:17:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:29:25] "web team does not typically check IRC so assume this is not being used if 5 minutes past the start" [23:29:29] classy [23:29:51] (03CR) 10Tim Starling: [C:03+2] uppercaseTitlesForUnicodeTransition: Add file table [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1165179 (https://phabricator.wikimedia.org/T383496) (owner: 10Tim Starling) [23:30:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:33:44] (03Merged) 10jenkins-bot: uppercaseTitlesForUnicodeTransition: Add file table [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1165179 (https://phabricator.wikimedia.org/T383496) (owner: 10Tim Starling) [23:34:42] !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1165179|uppercaseTitlesForUnicodeTransition: Add file table (T383496)]] [23:34:47] T383496: Add support for reading new file schema into MediaWiki - https://phabricator.wikimedia.org/T383496 [23:36:41] !log tstarling@deploy1003 tstarling: Backport for [[gerrit:1165179|uppercaseTitlesForUnicodeTransition: Add file table (T383496)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:38:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1165180 [23:38:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1165180 (owner: 10TrainBranchBot) [23:38:47] !log tstarling@deploy1003 tstarling: Continuing with sync [23:42:59] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10961353 (10Ladsgroup) ` root@ms-fe1009:~# swift stat --lh wikipedia-commons-local-thumb.13 Account: AUTH_mw Container: wikipedia-commons-local-thumb.13... [23:43:47] (03PS1) 10Clare Ming: xLab: Deploy v0.7.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165181 (https://phabricator.wikimedia.org/T396151) [23:44:16] !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165179|uppercaseTitlesForUnicodeTransition: Add file table (T383496)]] (duration: 09m 34s) [23:44:22] T383496: Add support for reading new file schema into MediaWiki - https://phabricator.wikimedia.org/T383496 [23:45:20] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.7.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165181 (https://phabricator.wikimedia.org/T396151) (owner: 10Clare Ming) [23:45:42] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:47:00] (03Merged) 10jenkins-bot: xLab: Deploy v0.7.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165181 (https://phabricator.wikimedia.org/T396151) (owner: 10Clare Ming) [23:47:27] (03PS1) 10Clare Ming: xLab: Deploy v0.7.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165182 (https://phabricator.wikimedia.org/T396151) [23:48:32] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.7.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165182 (https://phabricator.wikimedia.org/T396151) (owner: 10Clare Ming) [23:49:53] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [23:50:07] (03Merged) 10jenkins-bot: xLab: Deploy v0.7.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165182 (https://phabricator.wikimedia.org/T396151) (owner: 10Clare Ming) [23:50:37] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1165180 (owner: 10TrainBranchBot) [23:50:52] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [23:53:32] FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:36] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [23:59:54] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply