[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T0000) [00:38:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1120254 [00:38:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1120254 (owner: 10TrainBranchBot) [00:48:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1120254 (owner: 10TrainBranchBot) [01:08:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1120255 [01:08:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1120255 (owner: 10TrainBranchBot) [01:28:31] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1120255 (owner: 10TrainBranchBot) [01:46:28] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/313a7cb6100300c73abcb7c73553167541bfbf58659551c83ecf3b830fa53fc9/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:06:28] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:08:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.17 [core] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120259 (https://phabricator.wikimedia.org/T382368) [02:08:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.17 [core] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120259 (https://phabricator.wikimedia.org/T382368) (owner: 10TrainBranchBot) [02:12:21] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:20:23] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.17 [core] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120259 (https://phabricator.wikimedia.org/T382368) (owner: 10TrainBranchBot) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T0300) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:35:24] (03CR) 10Pppery: [C:03+1] E_STRICT PHP constant deprecated since PHP 8.4 [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120155 (owner: 10Aklapper) [04:00:11] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T0400) [04:01:44] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120262 (https://phabricator.wikimedia.org/T382368) [04:01:45] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120262 (https://phabricator.wikimedia.org/T382368) (owner: 10TrainBranchBot) [04:02:37] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120262 (https://phabricator.wikimedia.org/T382368) (owner: 10TrainBranchBot) [04:03:04] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.17 refs T382368 [04:03:07] T382368: 1.44.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T382368 [04:51:25] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.17 refs T382368 (duration: 48m 21s) [04:51:29] T382368: 1.44.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T382368 [05:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T0500) [05:02:58] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.14 (duration: 02m 56s) [06:12:21] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T0700) [07:00:05] marostegui, Amir1, and federico3: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T0700). [07:08:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10558401 (10MoritzMuehlenhoff) [07:08:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1023.eqiad.wmnet [07:09:06] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10558402 (10ops-monitoring-bot) Draining ganeti1023.eqiad.wmnet of running VMs [07:11:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1023.eqiad.wmnet [07:18:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1023.eqiad.wmnet [07:18:28] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10558432 (10ops-monitoring-bot) Draining ganeti1023.eqiad.wmnet of running VMs [07:19:15] (03PS1) 10Muehlenhoff: Switch ganeti1023 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1120406 [07:27:29] (03CR) 10Awight: [C:03+1] [beta] Change sub-referencing feature flag to new name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120217 (https://phabricator.wikimedia.org/T373307) (owner: 10WMDE-Fisch) [07:46:00] (03PS1) 10Muehlenhoff: Fix Cumin aliases for LVSes with ongoing Liberica migration [puppet] - 10https://gerrit.wikimedia.org/r/1120461 (https://phabricator.wikimedia.org/T384477) [07:48:53] (03PS1) 10JMeybohm: Add second pair of kubeconfig files for restricted users [puppet] - 10https://gerrit.wikimedia.org/r/1120462 (https://phabricator.wikimedia.org/T378429) [07:48:56] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] [beta] Change sub-referencing feature flag to new name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120217 (https://phabricator.wikimedia.org/T373307) (owner: 10WMDE-Fisch) [08:00:05] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:03:23] (03PS2) 10JMeybohm: Add second pair of kubeconfig files for restricted users [puppet] - 10https://gerrit.wikimedia.org/r/1120462 (https://phabricator.wikimedia.org/T378429) [08:03:23] (03PS1) 10JMeybohm: pki::get_cert: Allow to get the same cert twice [puppet] - 10https://gerrit.wikimedia.org/r/1120464 (https://phabricator.wikimedia.org/T378429) [08:06:42] (03PS2) 10JMeybohm: pki::get_cert: Allow to get the same cert twice [puppet] - 10https://gerrit.wikimedia.org/r/1120464 (https://phabricator.wikimedia.org/T378429) [08:06:42] (03PS3) 10JMeybohm: Add second pair of kubeconfig files for restricted users [puppet] - 10https://gerrit.wikimedia.org/r/1120462 (https://phabricator.wikimedia.org/T378429) [08:16:13] (03PS1) 10Awight: [beta] Enable Community Configuration for Cite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120466 (https://phabricator.wikimedia.org/T378807) [08:19:13] (03PS1) 10Filippo Giunchedi: pontoon: add compat symlink at enroll time [puppet] - 10https://gerrit.wikimedia.org/r/1120467 [08:19:14] (03PS1) 10Filippo Giunchedi: pontoon: clarify failed push instructions [puppet] - 10https://gerrit.wikimedia.org/r/1120468 [08:20:16] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: clarify failed push instructions [puppet] - 10https://gerrit.wikimedia.org/r/1120468 (owner: 10Filippo Giunchedi) [08:20:18] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add compat symlink at enroll time [puppet] - 10https://gerrit.wikimedia.org/r/1120467 (owner: 10Filippo Giunchedi) [08:26:37] (03PS1) 10Jon Harald Søby: Rename global variable from the WikimediaIncubator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119778 [08:27:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119778 (owner: 10Jon Harald Søby) [08:28:40] Amir1, urbanecm, would it be okay to do this ^ deployment now, since there's technically 30 minutes left of the current window? 😅 [08:28:46] jhathaway: hey, sure! [08:28:51] ... [08:28:52] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120469 [08:28:55] :D [08:29:18] sorry for the ping [08:29:38] Jhs: for some reason, tab completion really refuses to ping you [08:29:52] Jhs: just double checking, you want https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1119778 live, right? [08:30:28] urbanecm, correct [08:31:07] I should rename myself to Jh0s or something, just to save hathaway from unintentional pings :D [08:31:10] (03PS2) 10Jon Harald Søby: Rename global variable from the WikimediaIncubator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119778 [08:31:12] (03CR) 10Urbanecm: [C:03+2] Rename global variable from the WikimediaIncubator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119778 (owner: 10Jon Harald Søby) [08:31:43] Jhs: that wouldn't really help, as i typed jh. and for some reason, that offers just one match [08:31:58] (03Merged) 10jenkins-bot: Rename global variable from the WikimediaIncubator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119778 (owner: 10Jon Harald Søby) [08:32:03] hmm, reload has helped [08:32:15] urbanecm, ah, case-sensitive tab completion maybe? [08:32:24] i tried both versions [08:32:46] anyway, a quirk [08:33:19] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1119778|Rename global variable from the WikimediaIncubator extension]] [08:36:04] (03CR) 10Vgutierrez: Fix Cumin aliases for LVSes with ongoing Liberica migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1120461 (https://phabricator.wikimedia.org/T384477) (owner: 10Muehlenhoff) [08:39:28] !log urbanecm@deploy2002 urbanecm, jhsoby: Backport for [[gerrit:1119778|Rename global variable from the WikimediaIncubator extension]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:39:39] Jhs: can you test? [08:41:04] PROBLEM - Host dse-k8s-worker1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:42:32] RECOVERY - Host dse-k8s-worker1002 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [08:42:41] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] [beta] Enable Community Configuration for Cite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120466 (https://phabricator.wikimedia.org/T378807) (owner: 10Awight) [08:42:55] 👀 [08:44:04] Jhs: how is the test going? [08:45:43] urbanecm, lemmesee [08:47:32] (03CR) 10Elukey: "Looks good, can you expand a little why this is needed? Usually duplicate declarations highlight some puppet-coding race condition, I just" [puppet] - 10https://gerrit.wikimedia.org/r/1120464 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [08:48:30] urbanecm, things still work like they should [08:48:33] !log urbanecm@deploy2002 urbanecm, jhsoby: Continuing with sync [08:48:37] awesome, proceeding [08:49:00] (it's very difficult to find pages where this variable actually comes into play… I can explain tomorrow if I remember and if you care… :P) [08:49:34] (03CR) 10Elukey: "My brain is missing why removing the labels and match selector helps with what written in the commit msg, can you expand a bit so your dea" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120193 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [08:50:31] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:50:42] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:51:26] (03PS2) 10Muehlenhoff: Fix Cumin aliases for LVSes with ongoing Liberica migration [puppet] - 10https://gerrit.wikimedia.org/r/1120461 (https://phabricator.wikimedia.org/T384477) [08:51:35] (03CR) 10Muehlenhoff: Fix Cumin aliases for LVSes with ongoing Liberica migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1120461 (https://phabricator.wikimedia.org/T384477) (owner: 10Muehlenhoff) [08:51:35] sounds good :) [08:52:31] (03CR) 10Vgutierrez: [C:03+1] "looks good, thanks for fixing this <3" [puppet] - 10https://gerrit.wikimedia.org/r/1120461 (https://phabricator.wikimedia.org/T384477) (owner: 10Muehlenhoff) [08:57:51] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119778|Rename global variable from the WikimediaIncubator extension]] (duration: 24m 32s) [08:58:08] Jhs: should be live [08:58:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 22.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:58:18] (btw, the regaining apparenly confuses irccloud's tab completion) [08:58:19] uhoh [09:00:57] Jhs: anything else? [09:01:24] nope! thanks :D [09:01:45] (as long as the uhoh wasn't something I did :P) [09:02:51] nope, at the not enough idle workers warning [09:03:00] (03CR) 10Muehlenhoff: [C:03+2] Fix Cumin aliases for LVSes with ongoing Liberica migration [puppet] - 10https://gerrit.wikimedia.org/r/1120461 (https://phabricator.wikimedia.org/T384477) (owner: 10Muehlenhoff) [09:03:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 20.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:07:00] (03PS2) 10Tiziano Fogli: grafana: failover [puppet] - 10https://gerrit.wikimedia.org/r/1120476 (https://phabricator.wikimedia.org/T385282) [09:07:01] (03CR) 10Tiziano Fogli: "To be merged after performing manual actions via Cumin on Grafana hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1120476 (https://phabricator.wikimedia.org/T385282) (owner: 10Tiziano Fogli) [09:10:18] (03CR) 10Kamila Součková: [C:03+1] wikikube: decommission wikikube-worker102[2-5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1114788 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [09:12:33] (03PS1) 10Tiziano Fogli: grafana: failover [dns] - 10https://gerrit.wikimedia.org/r/1120483 (https://phabricator.wikimedia.org/T385282) [09:12:33] (03CR) 10Tiziano Fogli: "To be merged after performing manual actions via Cumin on Grafana hosts" [dns] - 10https://gerrit.wikimedia.org/r/1120483 (https://phabricator.wikimedia.org/T385282) (owner: 10Tiziano Fogli) [09:21:18] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10558739 (10elukey) In the cumin2002 logs I see: ` 2025-02-07 11:40:17,558 jmm 2595123 [DEBUG redfish.py:912 in generation] ganeti1033: iD... [09:24:35] (03CR) 10Urbanecm: [C:04-1] "Yes! You'd have the processing in CS.php (which is evaluated for labs as well), the production experiment would be enabled in IS.php; for " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119537 (https://phabricator.wikimedia.org/T385903) (owner: 10Sergio Gimeno) [09:26:34] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1033.eqiad.wmnet [09:27:01] (03CR) 10Filippo Giunchedi: [C:03+1] grafana: failover [dns] - 10https://gerrit.wikimedia.org/r/1120483 (https://phabricator.wikimedia.org/T385282) (owner: 10Tiziano Fogli) [09:27:05] (03CR) 10Filippo Giunchedi: [C:03+1] grafana: failover [puppet] - 10https://gerrit.wikimedia.org/r/1120476 (https://phabricator.wikimedia.org/T385282) (owner: 10Tiziano Fogli) [09:32:34] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1148-1153].eqiad.wmnet [09:32:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1148-1153].eqiad.wmnet [09:32:48] 07sre-alert-triage, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Alert in need of triage: SmartNotHealthy (instance dse-k8s-worker1009:9100) - https://phabricator.wikimedia.org/T382871#10558770 (10brouberol) 05Open→03Resolved a:03brouberol The alert seems to have resolved. I can't see any active al... [09:33:39] (03PS1) 10Vgutierrez: hiera: Unify realserver::ipip mss values [puppet] - 10https://gerrit.wikimedia.org/r/1120488 [09:35:20] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120488 (owner: 10Vgutierrez) [09:35:46] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ganeti1033.eqiad.wmnet [09:36:23] (03CR) 10Urbanecm: [C:04-1] [Growth] Set default api lookahead size to 10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120180 (https://phabricator.wikimedia.org/T325990) (owner: 10Sergio Gimeno) [09:42:43] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1123.eqiad.wmnet [09:42:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1123.eqiad.wmnet [09:42:50] !log performing grafana failover (grafana2001 is becoming the new active host) T385282 [09:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:54] T385282: Disk space on grafana2001 is low - https://phabricator.wikimedia.org/T385282 [09:47:40] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1033.eqiad.wmnet [09:51:06] (03CR) 10Tiziano Fogli: [C:03+2] grafana: failover [puppet] - 10https://gerrit.wikimedia.org/r/1120476 (https://phabricator.wikimedia.org/T385282) (owner: 10Tiziano Fogli) [09:53:23] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ganeti1033.eqiad.wmnet [09:53:26] (03CR) 10Tiziano Fogli: [C:03+2] grafana: failover [dns] - 10https://gerrit.wikimedia.org/r/1120483 (https://phabricator.wikimedia.org/T385282) (owner: 10Tiziano Fogli) [09:55:49] !log tappof@dns1004 START - running authdns-update [09:56:47] (03CR) 10Vgutierrez: [C:03+2] hiera: Unify realserver::ipip mss values [puppet] - 10https://gerrit.wikimedia.org/r/1120488 (owner: 10Vgutierrez) [09:56:52] (03PS1) 10Lucas Werkmeister (WMDE): Rename the `tmpEnableMulLanguageCode` flag to `enableMulLanguageCode` [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120494 (https://phabricator.wikimedia.org/T330217) [09:56:53] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10558854 (10elukey) All right installed `python3.9-dbg` on cumin2002, and ran the cookbook and used `py-bt` to verify where it hangs: ` (g... [09:57:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120494 (https://phabricator.wikimedia.org/T330217) (owner: 10Lucas Werkmeister (WMDE)) [09:57:57] !log tappof@dns1004 END - running authdns-update [09:58:26] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:58:29] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:58:52] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:58:53] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:58:58] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:59:06] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:06:58] (03PS1) 10Vgutierrez: hiera,swift: Enable IPIP on ms-fe@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564) [10:07:06] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:07:22] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:07:31] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:07:55] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:08:19] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:08:26] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:09:29] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:09:36] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:09:43] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:10:23] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10558896 (10elukey) I confirm that with spicerack-shell I can see the following hanging: ` >>> pprint(r.upload_file(Path("/srv/firmware/po... [10:10:31] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:10:41] (03PS3) 10Brouberol: flink-k8s-operator: publish version 1.10.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120489 (https://phabricator.wikimedia.org/T377137) [10:11:23] (03CR) 10Brouberol: flink-k8s-operator: publish version 1.10.0 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120489 (https://phabricator.wikimedia.org/T377137) (owner: 10Brouberol) [10:12:21] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:10] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [10:14:18] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:14:25] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:14:41] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1033.eqiad.wmnet [10:21:25] (03PS2) 10Vgutierrez: hiera,swift: Enable IPIP on ms-fe@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564) [10:22:19] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [10:22:38] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10558933 (10MoritzMuehlenhoff) >>! In T385873#10558896, @elukey wrote: > I have no idea how long it takes for the BMC to fetch ~200MB of da... [10:24:03] !log installing libpgjava security updates [10:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:59] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100819 (https://phabricator.wikimedia.org/T381580) (owner: 10Tiziano Fogli) [10:36:42] (03CR) 10DCausse: flink-k8s-operator: publish version 1.10.0 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120489 (https://phabricator.wikimedia.org/T377137) (owner: 10Brouberol) [10:36:55] (03PS2) 10Sergio Gimeno: beta: A/B test setup for surfacing structured tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119537 (https://phabricator.wikimedia.org/T385903) [10:37:36] (03PS2) 10Esanders: Deploy DiscussionTools visual enhancements to top 10 wikis (exc. enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087539 (https://phabricator.wikimedia.org/T379102) [10:40:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087539 (https://phabricator.wikimedia.org/T379102) (owner: 10Esanders) [10:40:48] (03PS3) 10Esanders: Deploy DiscussionTools visual enhancements to top 10 wikis (exc. enwiki, ruwiki & zhwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087539 (https://phabricator.wikimedia.org/T379102) [10:41:19] (03CR) 10David Caro: "LGTM, @aborrero@wikimedia.org can you give it a look also just in case?" [puppet] - 10https://gerrit.wikimedia.org/r/1100819 (https://phabricator.wikimedia.org/T381580) (owner: 10Tiziano Fogli) [10:43:28] (03PS3) 10Sergio Gimeno: cswiki beta: A/B test setup for surfacing structured tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119537 (https://phabricator.wikimedia.org/T385903) [10:43:28] (03PS1) 10Sergio Gimeno: [Growth] Enable surfacing structured tasks A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120505 (https://phabricator.wikimedia.org/T385343) [10:43:29] jouncebot: nowandnext [10:43:29] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [10:43:29] In 0 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T1100) [10:43:43] (03CR) 10Sergio Gimeno: [C:04-1] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120505 (https://phabricator.wikimedia.org/T385343) (owner: 10Sergio Gimeno) [10:44:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120149 (owner: 10Majavah) [10:45:03] (03Merged) 10jenkins-bot: wikitech: Unset $wgEnableCreativeCommonsRdf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120149 (owner: 10Majavah) [10:45:30] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1120149|wikitech: Unset $wgEnableCreativeCommonsRdf]] [10:45:31] (03CR) 10Sergio Gimeno: "Gotcha, ty! I restricted it to cswiki per slack thread https://wikimedia.slack.com/archives/G0101329ZC7/p1739816290355579" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119537 (https://phabricator.wikimedia.org/T385903) (owner: 10Sergio Gimeno) [10:45:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119537 (https://phabricator.wikimedia.org/T385903) (owner: 10Sergio Gimeno) [10:47:00] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10559007 (10A_smart_kitten) Tentatively adding to the #SRE-swift-storage queue in case they can determine what went wrong here... [10:47:45] (03PS4) 10Sergio Gimeno: cswiki beta: A/B test setup for surfacing structured tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119537 (https://phabricator.wikimedia.org/T385903) [10:47:45] (03PS2) 10Sergio Gimeno: [Growth] Enable surfacing structured tasks A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120505 (https://phabricator.wikimedia.org/T385343) [10:50:22] !log tappof@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on grafana1002.eqiad.wmnet with reason: expand the root partition and fs on grafana1002 [10:50:35] (03PS1) 10Elukey: profile::cumin::master: add python3-dbg [puppet] - 10https://gerrit.wikimedia.org/r/1120506 [10:51:29] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10559022 (10MoritzMuehlenhoff) Seems I was just too impatient (or unaware how slow it can be for some firmwares), it completed after roughl... [10:51:29] !log taavi@deploy2002 taavi: Backport for [[gerrit:1120149|wikitech: Unset $wgEnableCreativeCommonsRdf]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:51:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet [10:51:45] !log taavi@deploy2002 taavi: Continuing with sync [10:54:07] (03CR) 10Volans: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1120506 (owner: 10Elukey) [10:57:42] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:58:40] (03PS2) 10Elukey: profile::cumin::master: add python3-dbg [puppet] - 10https://gerrit.wikimedia.org/r/1120506 [10:59:27] (03PS2) 10Federico Ceratto: clone.py: Add helper functions for later use [cookbooks] - 10https://gerrit.wikimedia.org/r/1120213 [10:59:36] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1120506 (owner: 10Elukey) [11:00:04] (03PS3) 10Elukey: profile::cumin::{cloud_master,master}: add python3-dbg [puppet] - 10https://gerrit.wikimedia.org/r/1120506 [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T1100) [11:00:30] (03CR) 10Elukey: "Sure, added David and Arturo in the loop :)" [puppet] - 10https://gerrit.wikimedia.org/r/1120506 (owner: 10Elukey) [11:00:40] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1120506 (owner: 10Elukey) [11:01:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1033.eqiad.wmnet [11:01:15] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120149|wikitech: Unset $wgEnableCreativeCommonsRdf]] (duration: 15m 45s) [11:01:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1033.eqiad.wmnet [11:01:40] (03CR) 10Brouberol: flink-k8s-operator: publish version 1.10.0 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120489 (https://phabricator.wikimedia.org/T377137) (owner: 10Brouberol) [11:02:42] RESOLVED: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:04:29] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1120506 (owner: 10Elukey) [11:05:40] (03CR) 10Elukey: [C:03+2] profile::cumin::{cloud_master,master}: add python3-dbg [puppet] - 10https://gerrit.wikimedia.org/r/1120506 (owner: 10Elukey) [11:07:15] (03PS4) 10Brouberol: flink-k8s-operator: publish version 1.10.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120489 (https://phabricator.wikimedia.org/T377137) [11:15:14] (03PS4) 10Majavah: Allow users to sign up on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077048 (https://phabricator.wikimedia.org/T371374) [11:15:14] (03PS1) 10Majavah: wikitech: Remove useless conditional [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120509 [11:40:04] (03CR) 10Kamila Součková: [C:03+1] Add second pair of kubeconfig files for restricted users [puppet] - 10https://gerrit.wikimedia.org/r/1120462 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [11:45:36] (03CR) 10Ladsgroup: [C:03+1] wikitech: Remove useless conditional [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120509 (owner: 10Majavah) [11:50:37] (03PS5) 10Majavah: Allow users to sign up on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077048 (https://phabricator.wikimedia.org/T377074) [12:01:20] !log installing openjdk-17 security updates [12:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:32] FIRING: Wikidata Reliability Metrics - Median Payload alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert [12:12:39] FIRING: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:13:13] lots of wikidata alerts are firing at the same time [12:13:43] but on https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?orgId=1&refresh=30s everything looks fine o_O [12:14:55] looks like a false alarm to me, though I don’t know why [12:15:12] let’s see if any “resolved” emails come in, I guess [12:17:33] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [12:27:12] (03CR) 10JMeybohm: "Removing the labels from the selector does make it match more Pods, not only the one running the controller (which was the only pod exposi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120193 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:32:05] (03CR) 10JMeybohm: "My very rude pcc run "completed" at https://puppet-compiler.wmflabs.org/output/1120464/4946/" [puppet] - 10https://gerrit.wikimedia.org/r/1120464 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [12:32:32] RESOLVED: Wikidata Reliability Metrics - Median Payload alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert [12:32:39] RESOLVED: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:36:38] (03PS2) 10Thiemo Kreuz (WMDE): [beta] Enable Community Configuration for Cite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120466 (https://phabricator.wikimedia.org/T386706) (owner: 10Awight) [12:37:14] PROBLEM - Disk space on grafana1002 is CRITICAL: DISK CRITICAL - free space: / 634MiB (3% inode=43%): /tmp 634MiB (3% inode=43%): /var/tmp 634MiB (3% inode=43%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana1002&var-datasource=eqiad+prometheus/ops [12:37:33] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [12:40:32] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1033.eqiad.wmnet with reason: remove from cluster for reimage [12:40:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10559276 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b44e5f88-5e48-499b-b781-f104874dd4e9) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [12:41:14] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1023.eqiad.wmnet with reason: remove from cluster for reimage [12:41:20] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10559284 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5719e915-71ee-41e3-83de-61bb417448f0) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [12:44:52] (03PS1) 10Aklapper: Move admin check from isFriendlyUser() to PhabricatorPeopleQuery [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120530 (https://phabricator.wikimedia.org/T386704) [12:45:40] (03CR) 10Urbanecm: [C:03+1] "looks good (for beta purposes at least), but logged a question inline" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119537 (https://phabricator.wikimedia.org/T385903) (owner: 10Sergio Gimeno) [12:46:32] (03PS2) 10Aklapper: Move admin check from isFriendlyUser() to PhabricatorPeopleQuery [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120530 (https://phabricator.wikimedia.org/T386704) [12:53:30] (03PS1) 10Gmodena: Revert "cirrus: enable mlr-2025 for select wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120534 [13:00:07] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T1300) [13:03:05] (03PS1) 10Muehlenhoff: Bump versions of Java 11/17 production images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120544 [13:04:17] jouncebot: nowandnext [13:04:18] For the next 0 hour(s) and 55 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T1300) [13:04:18] In 0 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T1400) [13:04:39] (03CR) 10Majavah: [C:03+2] wikitech: Remove useless conditional [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120509 (owner: 10Majavah) [13:04:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3587 MB (3% inode=98%): /tmp 3587 MB (3% inode=98%): /var/tmp 3587 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [13:04:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120509 (owner: 10Majavah) [13:05:49] (03Merged) 10jenkins-bot: wikitech: Remove useless conditional [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120509 (owner: 10Majavah) [13:06:16] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1120509|wikitech: Remove useless conditional]] [13:11:07] !log taavi@deploy2002 taavi: Backport for [[gerrit:1120509|wikitech: Remove useless conditional]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:11:44] !log taavi@deploy2002 taavi: Continuing with sync [13:13:24] (03PS2) 10Andrew Bogott: cloud-vps resolv.conf: remove .eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/1118151 (https://phabricator.wikimedia.org/T380679) [13:14:26] (03CR) 10Hnowlan: [C:03+1] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120469 (owner: 10PipelineBot) [13:14:57] (03CR) 10Urbanecm: "I'd suggest either migrating all of the fixLinkRecommendationData-dryrun jobs together, or picking a different one (maybe `growthexperimen" [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [13:15:26] !log installing openjdk-11 security updates [13:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:28] (03CR) 10Urbanecm: mediawiki: Migrate one dry-run job to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [13:18:31] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120509|wikitech: Remove useless conditional]] (duration: 12m 15s) [13:18:35] (03CR) 10DCausse: [C:03+1] "thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120489 (https://phabricator.wikimedia.org/T377137) (owner: 10Brouberol) [13:19:16] (03PS1) 10Andrew Bogott: cloudgw1003: replace cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1120548 (https://phabricator.wikimedia.org/T382356) [13:21:35] FIRING: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [13:21:38] FIRING: Wikidata Reliability Metrics - Median Payload alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert [13:21:39] (03Abandoned) 10Andrew Bogott: cloudgw1003: replace cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1120548 (https://phabricator.wikimedia.org/T382356) (owner: 10Andrew Bogott) [13:21:44] !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: Java sec updates - jmm@cumin2002 [13:24:06] (03PS6) 10Andrew Bogott: cloudgw1003: take over cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez) [13:24:11] (03CR) 10Brouberol: [C:03+2] flink-k8s-operator: publish version 1.10.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120489 (https://phabricator.wikimedia.org/T377137) (owner: 10Brouberol) [13:24:14] (03CR) 10Brouberol: [V:03+2 C:03+2] flink-k8s-operator: publish version 1.10.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120489 (https://phabricator.wikimedia.org/T377137) (owner: 10Brouberol) [13:25:28] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez) [13:26:34] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [13:31:21] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps resolv.conf: remove .eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/1118151 (https://phabricator.wikimedia.org/T380679) (owner: 10Andrew Bogott) [13:39:26] (03CR) 10Andrew Bogott: [C:03+2] sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [13:41:34] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [13:41:35] RESOLVED: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [13:41:38] RESOLVED: Wikidata Reliability Metrics - Median Payload alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert [13:42:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: Java sec updates - jmm@cumin2002 [13:44:56] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] cloudgw1003: take over cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez) [13:45:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1023.eqiad.wmnet [13:56:02] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1023.eqiad.wmnet [13:56:39] (03CR) 10Sergio Gimeno: cswiki beta: A/B test setup for surfacing structured tasks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119537 (https://phabricator.wikimedia.org/T385903) (owner: 10Sergio Gimeno) [13:57:14] PROBLEM - Disk space on grafana1002 is CRITICAL: DISK CRITICAL - free space: / 611MiB (3% inode=43%): /tmp 611MiB (3% inode=43%): /var/tmp 611MiB (3% inode=43%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana1002&var-datasource=eqiad+prometheus/ops [13:59:53] (03CR) 10Elukey: [C:03+1] "ahhh right right, makes sense!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120193 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T1400). [14:00:05] Lucas_WMDE, edsanders, and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:22] Hello [14:01:10] here [14:01:13] (03CR) 10Elukey: [C:03+1] pki::get_cert: Allow to get the same cert twice [puppet] - 10https://gerrit.wikimedia.org/r/1120464 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [14:01:53] hey. i can deploy. [14:02:04] !log tappof@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on grafana1002.eqiad.wmnet with reason: expand the root partition and fs on grafana1002 [14:02:09] (03CR) 10Urbanecm: [C:03+2] cswiki beta: A/B test setup for surfacing structured tasks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119537 (https://phabricator.wikimedia.org/T385903) (owner: 10Sergio Gimeno) [14:02:12] (03CR) 10Elukey: "One nit and you are good to go!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120544 (owner: 10Muehlenhoff) [14:02:34] (03PS4) 10Esanders: Deploy DiscussionTools visual enhancements to top 10 wikis (exc. enwiki, ruwiki & zhwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087539 (https://phabricator.wikimedia.org/T379102) [14:02:41] (03CR) 10Urbanecm: [C:03+2] Deploy DiscussionTools visual enhancements to top 10 wikis (exc. enwiki, ruwiki & zhwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087539 (https://phabricator.wikimedia.org/T379102) (owner: 10Esanders) [14:02:58] (03Merged) 10jenkins-bot: cswiki beta: A/B test setup for surfacing structured tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119537 (https://phabricator.wikimedia.org/T385903) (owner: 10Sergio Gimeno) [14:03:25] urbanecm: thanks [14:03:37] np [14:03:49] (03Merged) 10jenkins-bot: Deploy DiscussionTools visual enhancements to top 10 wikis (exc. enwiki, ruwiki & zhwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087539 (https://phabricator.wikimedia.org/T379102) (owner: 10Esanders) [14:05:14] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1087539|Deploy DiscussionTools visual enhancements to top 10 wikis (exc. enwiki, ruwiki & zhwiki) (T379102)]], [[gerrit:1119537|cswiki beta: A/B test setup for surfacing structured tasks (T385903)]] [14:05:19] T379102: [MILESTONE] Offer Usability Improvements as default-on feature at Phase 3 wikis (desktop) - https://phabricator.wikimedia.org/T379102 [14:05:19] T385903: Surfacing "Add a link" Structured Tasks: Set up A/B Test - https://phabricator.wikimedia.org/T385903 [14:09:16] o/ [14:09:20] sorry I’m late [14:09:58] !log urbanecm@deploy2002 urbanecm, esanders, sgimeno: Backport for [[gerrit:1087539|Deploy DiscussionTools visual enhancements to top 10 wikis (exc. enwiki, ruwiki & zhwiki) (T379102)]], [[gerrit:1119537|cswiki beta: A/B test setup for surfacing structured tasks (T385903)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:10:11] edsanders: can you test? [14:10:20] sergi0: fyi, but i think your is a no-op for prod [14:10:22] testing [14:10:29] (03CR) 10Urbanecm: [C:03+2] Rename the `tmpEnableMulLanguageCode` flag to `enableMulLanguageCode` [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120494 (https://phabricator.wikimedia.org/T330217) (owner: 10Lucas Werkmeister (WMDE)) [14:10:44] Lucas_WMDE: no worries, i started already. i've +2'ed your backport, and i'll hand it over to you once i finish the config patches? [14:10:51] sounds good, thanks! [14:11:00] urbanecm: it's no-op indeed, but, does the change in CS though mean I need to check nothing breaks? [14:11:16] sergi0: in theory, but there's no targetted check to do [14:11:22] feel free to do any tests you might have, i'll wait [14:11:25] (03PS2) 10Federico Ceratto: clone.py: Cleanup, extract fqdn and hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1120214 [14:11:53] (03PS1) 10Urbanecm: growthexperiments.pp: Mark unnecessary jobs as absent [puppet] - 10https://gerrit.wikimedia.org/r/1120556 (https://phabricator.wikimedia.org/T385782) [14:11:56] (03PS1) 10Urbanecm: growthexperiments.pp: Drop absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/1120557 (https://phabricator.wikimedia.org/T385782) [14:12:21] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:12:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1033.eqiad.wmnet with OS bookworm [14:12:57] urbanecm: nothing to test, it's fine [14:13:00] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10559726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1033.eqiad.wmnet with OS bookworm [14:13:01] sounds good [14:13:06] waiting for ed [14:14:33] urbanecm: lgtm [14:14:33] (03PS2) 10Tobias Gritschacher: [beta] Change sub-referencing feature flag to new name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120217 (https://phabricator.wikimedia.org/T373307) (owner: 10WMDE-Fisch) [14:14:38] !log urbanecm@deploy2002 urbanecm, esanders, sgimeno: Continuing with sync [14:14:43] proceeding, ty [14:15:32] (03CR) 10Urbanecm: [C:04-1] "The Growth team decided to drop this job (and the few others), see I2068f22ab55cec7ecad9462a8396e36ddc2c6642. See task for my suggestions " [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [14:15:57] (03CR) 10Kamila Součková: [C:03+1] growthexperiments.pp: Mark unnecessary jobs as absent [puppet] - 10https://gerrit.wikimedia.org/r/1120556 (https://phabricator.wikimedia.org/T385782) (owner: 10Urbanecm) [14:16:52] (03PS3) 10Federico Ceratto: clone.py: Cleanup, extract fqdn and hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1120214 [14:17:08] (03CR) 10Michael Große: [C:03+1] "Removing them is fine, the higher frequency that these jobs provided is no longer needed." [puppet] - 10https://gerrit.wikimedia.org/r/1120557 (https://phabricator.wikimedia.org/T385782) (owner: 10Urbanecm) [14:18:12] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4948/co" [puppet] - 10https://gerrit.wikimedia.org/r/1120556 (https://phabricator.wikimedia.org/T385782) (owner: 10Urbanecm) [14:21:44] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087539|Deploy DiscussionTools visual enhancements to top 10 wikis (exc. enwiki, ruwiki & zhwiki) (T379102)]], [[gerrit:1119537|cswiki beta: A/B test setup for surfacing structured tasks (T385903)]] (duration: 16m 29s) [14:21:48] T379102: [MILESTONE] Offer Usability Improvements as default-on feature at Phase 3 wikis (desktop) - https://phabricator.wikimedia.org/T379102 [14:21:49] T385903: Surfacing "Add a link" Structured Tasks: Set up A/B Test - https://phabricator.wikimedia.org/T385903 [14:21:50] edsanders: sergi0: synced [14:21:57] Lucas_WMDE: over to you! [14:21:57] thanks! [14:22:02] thanks! [14:22:03] Thank you! [14:22:10] np [14:22:13] (03Merged) 10jenkins-bot: Rename the `tmpEnableMulLanguageCode` flag to `enableMulLanguageCode` [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120494 (https://phabricator.wikimedia.org/T330217) (owner: 10Lucas Werkmeister (WMDE)) [14:22:21] perfect timing :D [14:22:46] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1120494|Rename the `tmpEnableMulLanguageCode` flag to `enableMulLanguageCode` (T330217)]] [14:22:49] T330217: MUL - Cleanup soft rollout flag - https://phabricator.wikimedia.org/T330217 [14:23:03] yep! [14:23:20] (03CR) 10Michael Große: [C:03+1] "Looks good from the Growth perspective too, we do not need the high-frequency data from these jobs anymore" [puppet] - 10https://gerrit.wikimedia.org/r/1120556 (https://phabricator.wikimedia.org/T385782) (owner: 10Urbanecm) [14:24:57] (03CR) 10Kamila Součková: [C:03+2] benthos: update chart's modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115885 (https://phabricator.wikimedia.org/T385210) (owner: 10Kamila Součková) [14:26:24] (03Merged) 10jenkins-bot: benthos: update chart's modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115885 (https://phabricator.wikimedia.org/T385210) (owner: 10Kamila Součková) [14:26:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet [14:27:23] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1120494|Rename the `tmpEnableMulLanguageCode` flag to `enableMulLanguageCode` (T330217)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:27:34] testing [14:28:58] lgtm [14:29:00] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:31:09] Anyone around? [14:31:22] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1033.eqiad.wmnet with reason: host reimage [14:31:41] yes? [14:32:11] I'm not sure if it's a good time to add another backport. [14:33:44] I'd like to ask to backport https://gerrit.wikimedia.org/r/1120561 to prevent broken translations on Hakka (hak) sites. [14:34:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1033.eqiad.wmnet with reason: host reimage [14:34:40] But I'm not sure whether it works for cutted but non-deployed branches. [14:34:41] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10559788 (10MoritzMuehlenhoff) [14:35:00] well, it’s deployed on testwiki [14:35:17] probably worth backporting, it’ll just take longer because of the i18n changes [14:35:28] but there’s nothing else in the window once my current backport is done [14:35:36] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120494|Rename the `tmpEnableMulLanguageCode` flag to `enableMulLanguageCode` (T330217)]] (duration: 12m 50s) [14:35:40] T330217: MUL - Cleanup soft rollout flag - https://phabricator.wikimedia.org/T330217 [14:35:42] (I was thinking of adding another config change but it’s not urgent) [14:35:53] Winston_Sung: want to do it now? are you still going to be around for the next half hour or so? [14:36:00] (wild guess at how long the l10n rebuild will take. might be even longer tbh) [14:36:00] Yes. [14:36:03] ok then let’s do it [14:36:08] Thanks. [14:36:23] Just added to the Deployment calendar. [14:36:33] `scap backport` complains that it’s WIP ^^ [14:36:37] (03PS2) 10Lucas Werkmeister (WMDE): i18n: Split hak.json system messages [core] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120561 (https://phabricator.wikimedia.org/T371883) (owner: 10Winston Sung) [14:36:41] * Lucas_WMDE marks as active [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:02] Oops. [14:37:09] Marked as ready for review. [14:37:14] RECOVERY - Disk space on grafana1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana1002&var-datasource=eqiad+prometheus/ops [14:37:30] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker2004.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [14:37:37] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [14:38:16] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [14:38:21] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker1006.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [14:38:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet [14:38:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1023.eqiad.wmnet [14:38:45] I’m just comparing with master… the master branch also has languages/i18n/datetime/hak-hans.json and languages/i18n/datetime/hak-hant.json which are missing on the wmf branch, is that okay? [14:39:32] Emm.. not a required but a better to have. [14:39:33] seemingly added in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1120405, at the same time as hak-latn.json which is part of the backport / cherry-pick [14:39:34] ok [14:39:56] then I’ll wait for a the next patch set [14:41:35] (03PS3) 10Winston Sung: i18n: Split hak.json system messages [core] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120561 (https://phabricator.wikimedia.org/T371883) [14:42:21] ok, now the `git diff --stat master languages/` looks more sensible (only other l10n changes) [14:42:23] thanks! [14:42:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120561 (https://phabricator.wikimedia.org/T371883) (owner: 10Winston Sung) [14:43:11] (03CR) 10Winston Sung: "Cherry-picked `hak-hans.json` , `hak-hant.json` from https://gerrit.wikimedia.org/r/1120405 "Localisation updates from https://translatewi" [core] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120561 (https://phabricator.wikimedia.org/T371883) (owner: 10Winston Sung) [14:45:33] (03CR) 10Winston Sung: "(Refer to the `languages/i18n/datetime/` part.)" [core] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120561 (https://phabricator.wikimedia.org/T371883) (owner: 10Winston Sung) [14:45:50] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1023 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1120406 (owner: 10Muehlenhoff) [14:46:37] (03PS1) 10Ssingh: wikidough: update DoT TLS1.3 ciphers to match DoH [puppet] - 10https://gerrit.wikimedia.org/r/1120565 [14:47:39] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4949/co" [puppet] - 10https://gerrit.wikimedia.org/r/1120565 (owner: 10Ssingh) [14:53:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1033.eqiad.wmnet with OS bookworm [14:53:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10559826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1033.eqiad.wmnet with OS bookworm completed: - ganeti103... [14:55:08] (03Merged) 10jenkins-bot: i18n: Split hak.json system messages [core] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120561 (https://phabricator.wikimedia.org/T371883) (owner: 10Winston Sung) [14:55:27] yay, it merged [14:55:39] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1120561|i18n: Split hak.json system messages (T371883)]] [14:55:42] T371883: Split Hakka (hak) translations - https://phabricator.wikimedia.org/T371883 [14:55:43] now to build the image, which I expect will still take quite some time [14:56:12] “4 languages rebuilt out of 534” [14:56:21] hak, hak-hans, hak-hant, hak-latn – checks out [14:57:19] “Finished build-and-push-container-images (duration: 01m 01s)” [14:57:19] huh! [14:57:30] jouncebot: nowandnext [14:57:31] For the next 0 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T1400) [14:57:31] In 1 hour(s) and 2 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T1600) [14:57:33] ok phew [14:57:39] I should’ve checked this first, really [14:59:35] (03CR) 10Ssingh: [V:03+1 C:03+2] wikidough: update DoT TLS1.3 ciphers to match DoH [puppet] - 10https://gerrit.wikimedia.org/r/1120565 (owner: 10Ssingh) [15:00:49] !log lucaswerkmeister-wmde@deploy2002 wsung, lucaswerkmeister-wmde: Backport for [[gerrit:1120561|i18n: Split hak.json system messages (T371883)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:00:53] T371883: Split Hakka (hak) translations - https://phabricator.wikimedia.org/T371883 [15:00:57] Winston_Sung: please test! [15:01:02] (that was much faster than I feared, yay) [15:01:13] !log cumin A:wikidough 'run-puppet-agent' [15:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:55] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Apply JDK 11 update - eevans@cumin1002 [15:04:06] Tested, all good for now. [15:04:11] !log lucaswerkmeister-wmde@deploy2002 wsung, lucaswerkmeister-wmde: Continuing with sync [15:04:13] nice, thanks! [15:04:22] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [15:04:53] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:05:12] ^ BGP alerts expected, will keep an eye out for the non-obvious ones [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:47] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120561|i18n: Split hak.json system messages (T371883)]] (duration: 15m 08s) [15:10:51] T371883: Split Hakka (hak) translations - https://phabricator.wikimedia.org/T371883 [15:10:52] \o/ [15:11:17] !log UTC afternoon backport+config window done [15:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:23] !log deploying mediawiki.org/beacon/event - don't raise error on failure [15:15:23] - T383939 [15:15:25] !log otto@deploy2002 Started scap sync-world: Backport for [[gerrit:1115111|mediawiki.org/beacon/event - don't raise error on failure (T383939 T353817)]] [15:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:26] T383939: PHP Unknown error: EventLoggingLegacyConverter: Failed proxying legacy EventLogging event query string to WMF Event Platform JSON: UnexpectedValueException: TemplateDataEditor is not in the list of allowed legacy schemas. - https://phabricator.wikimedia.org/T383939 [15:15:30] T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 [15:15:48] \o/ [15:17:24] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough [15:17:59] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [15:20:06] !log otto@deploy2002 otto: Backport for [[gerrit:1115111|mediawiki.org/beacon/event - don't raise error on failure (T383939 T353817)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:20:34] !log otto@deploy2002 otto: Continuing with sync [15:27:38] !log otto@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115111|mediawiki.org/beacon/event - don't raise error on failure (T383939 T353817)]] (duration: 12m 12s) [15:27:42] T383939: PHP Unknown error: EventLoggingLegacyConverter: Failed proxying legacy EventLogging event query string to WMF Event Platform JSON: UnexpectedValueException: TemplateDataEditor is not in the list of allowed legacy schemas. - https://phabricator.wikimedia.org/T383939 [15:27:43] T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 [15:28:15] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [15:30:33] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 14708MiB (3% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [15:35:07] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [15:41:12] !log performing grafana failback (grafana1002 is becoming the new active host) T385282 [15:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:17] T385282: Disk space on grafana2001 is low - https://phabricator.wikimedia.org/T385282 [15:44:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3597 MB (3% inode=98%): /tmp 3597 MB (3% inode=98%): /var/tmp 3597 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [15:44:58] !log tappof@dns1004 START - running authdns-update [15:45:11] !log T386711 Ran mwscript-k8s --comment="T386711" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=nlwiktionary --logwiki=metawiki 'イム乙ノの' 'Renamed user 19841400c4049534bc11b1ec9a011fb8' [15:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:14] T386711: Unblock stuck global rename of Renamed_user_19841400c4049534bc11b1ec9a011fb8 - https://phabricator.wikimedia.org/T386711 [15:45:24] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [15:46:49] !log tappof@dns1004 END - running authdns-update [15:47:39] !log unarchive debs/dnsdist repository on Gerrit [15:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:44] !log tappof@dns1004 START - running authdns-update [15:51:42] !log tappof@dns1004 END - running authdns-update [16:00:05] jelto, arnoldokoth, and mutante: May I have your attention please! SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T1600) [16:00:25] FIRING: [2x] SystemdUnitFailed: docker.service on ml-lab1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:26] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=20; selector: name=wikikube-worker2002.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:00:32] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=20; selector: name=wikikube-worker2003.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:05:00] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: phab deploy [16:05:18] !log brennen@deploy2002 Started deploy [phabricator/deployment@c1262ac]: deploy phab2002 for T386522 [16:05:25] RESOLVED: [2x] SystemdUnitFailed: docker.service on ml-lab1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:25] T386522: Deploy Phabricator/Phorge 2025-02-18 - https://phabricator.wikimedia.org/T386522 [16:05:32] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=20; selector: name=wikikube-worker2004.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:05:38] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=20; selector: name=wikikube-worker2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:05:47] !log brennen@deploy2002 Finished deploy [phabricator/deployment@c1262ac]: deploy phab2002 for T386522 (duration: 00m 28s) [16:05:52] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=20; selector: name=wikikube-worker2001.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:05:59] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: phab deploy [16:06:14] !log brennen@deploy2002 Started deploy [phabricator/deployment@c1262ac]: deploy phab1004 for T386522 [16:07:31] !log brennen@deploy2002 Finished deploy [phabricator/deployment@c1262ac]: deploy phab1004 for T386522 (duration: 01m 17s) [16:07:40] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=20; selector: name=wikikube-worker100*.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [16:09:02] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=20; selector: name=wikikube-worker100.*.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [16:12:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:14:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:18:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:19:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:29:39] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker100.*.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [16:29:48] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker200.*.eqiad.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:30:03] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker200.*.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:31:41] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Apply JDK 11 update - eevans@cumin1002 [16:41:42] I dont see my gerrit upload here as normal. seems a bot is down. [16:43:37] * Lucas_WMDE peeks at wikibugs [16:43:55] thank you [16:44:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3500 MB (3% inode=98%): /tmp 3500 MB (3% inode=98%): /var/tmp 3500 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [16:45:03] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Apply JDK 11 update - eevans@cumin1002 [16:45:48] restarted, let’s see if it helps… [16:46:11] appreciate it, Lucas_WMDE [16:46:35] thank you for the tip btw [16:46:40] looks like we lost almost two hours of events :S [16:46:45] (judging by #wikimedia-dev) [16:46:49] (03CR) 10Aklapper: [V:03+2 C:03+2] E_STRICT PHP constant deprecated since PHP 8.4 [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120158 (owner: 10Aklapper) [16:47:11] yay [16:50:14] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10560225 (10Dzahn) 05Open→03In progress Alright. Just leaving it assigned to you for now then. [16:55:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:56:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [16:57:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:59:04] (03CR) 10Effie Mouzeli: [C:03+1] package_builder: add pbuilder hook for pcre2 component [puppet] - 10https://gerrit.wikimedia.org/r/1120587 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [17:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T1700). [17:00:05] urbanecm and dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:10] * urbanecm waves [17:00:11] o/ [17:00:53] urbanecm: thanks for going through the two-patch motions correctly on the first try <3 I'll put out your first one, let puppet run while I deploy dancy's, and then go ahead with the second phase if that sgty [17:01:06] sounds good :) [17:01:55] (03CR) 10RLazarus: [C:03+2] growthexperiments.pp: Mark unnecessary jobs as absent [puppet] - 10https://gerrit.wikimedia.org/r/1120556 (https://phabricator.wikimedia.org/T385782) (owner: 10Urbanecm) [17:03:22] o/ [17:04:40] dancy: hello hello -- I'm going to deploy these both at once, cool with you? [17:04:46] yes please! [17:04:50] (03CR) 10RLazarus: [C:03+2] logspam.pl: Add emacs mode line [puppet] - 10https://gerrit.wikimedia.org/r/1119201 (owner: 10Ahmon Dancy) [17:04:59] (03CR) 10RLazarus: [C:03+2] logspam.pl: Consolidate the "Failed to load data blob" exception [puppet] - 10https://gerrit.wikimedia.org/r/1119202 (https://phabricator.wikimedia.org/T347064) (owner: 10Ahmon Dancy) [17:05:24] thanks for getting someone to review the perl so I didn't have to pretend I could <3 [17:05:56] haha.. no problem [17:06:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [17:07:08] puppet's running on mwlog1002 now, I'll let you know [17:07:08] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=wikikube-worker100.*.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [17:07:36] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=wikikube-worker200.*.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [17:07:57] and meanwhile back at the ranch, urbanecm: `rzl@mwmaint2002:~$ sudo systemctl list-units | grep fixLinkRecommendationData-dryrun-` comes back empty, anything else you want to check before I go ahead? [17:08:08] nope! [17:08:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:08:20] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:08:24] 👍 [17:08:30] (03CR) 10RLazarus: [C:03+2] growthexperiments.pp: Drop absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/1120557 (https://phabricator.wikimedia.org/T385782) (owner: 10Urbanecm) [17:08:45] (03PS2) 10Urbanecm: growthexperiments.pp: Drop absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/1120557 (https://phabricator.wikimedia.org/T385782) [17:09:40] dancy: okay have a look on mwlog1002 [17:10:21] rzl: Looks good. Thanks! [17:10:26] thank you! [17:10:57] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1120602 [17:11:14] (03CR) 10RLazarus: [C:03+2] growthexperiments.pp: Drop absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/1120557 (https://phabricator.wikimedia.org/T385782) (owner: 10Urbanecm) [17:14:06] all set -- thanks both for your patience with the lack of puppet windows last week during the SRE summit [17:16:58] (03PS1) 10Vgutierrez: hiera: Enable IPIP on ms-fe@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564) [17:19:39] 06SRE, 06Traffic: Define an event stream and schema for haproxy_requestctl analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10560361 (10Ottomata) @Fabfur {T383914} has been deployed, so it should be possible to remove the `meta.domain` field added in [[ https://gitlab.wikimedia.org/... [17:21:03] (03PS1) 10Ottomata: eventgate-analytics - bump to v1.11.0 for node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120604 (https://phabricator.wikimedia.org/T383814) [17:21:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:23:54] (03CR) 10Ottomata: [C:03+2] eventgate-analytics - bump to v1.11.0 for node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120604 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [17:24:21] (03PS1) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 [17:25:07] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [17:25:43] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [17:26:14] (03PS2) 10Vgutierrez: hiera: Enable IPIP on ms-fe@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564) [17:26:38] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [17:28:48] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [17:29:31] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [17:29:36] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10560401 (10Dzahn) I have sent an email to Arthur to verify the SSH key outside of this ticket. [17:30:09] !log upgrading eventgate-analytics in codfw to node20 (will let this simmer for a day before proceeding to eqiad) - T383814 [17:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:12] T383814: Upgrade eventgate-wikimedia to node20 - https://phabricator.wikimedia.org/T383814 [17:30:20] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [17:31:05] (03PS1) 10Ssingh: Release dnsdist 1.9.0-1+wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 [17:36:33] !log LDAP/mwmaint1002: changed email address for LDAP user jonkolbert (T386473) [17:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:41] (03PS1) 10Bernard Wang: Update Search AB test config, increase bucketing/sampling rates for eu/ca, deploy to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120609 [17:45:31] (03PS5) 10Herron: etcd: add etcd-backup-v3 script [puppet] - 10https://gerrit.wikimedia.org/r/1120602 (https://phabricator.wikimedia.org/T385727) [17:46:06] (03CR) 10Herron: "something to get the ball rolling on this, please let me know what you think 👍" [puppet] - 10https://gerrit.wikimedia.org/r/1120602 (https://phabricator.wikimedia.org/T385727) (owner: 10Herron) [17:55:52] (03PS1) 10Aklapper: Fix a typo in a description [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120613 [17:56:19] (03CR) 10Aklapper: [V:03+2 C:03+2] Fix a typo in a description [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120613 (owner: 10Aklapper) [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T1800) [18:04:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:04:42] (03PS1) 10Aklapper: Correct some option summaries about what they do [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120616 [18:04:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3375 MB (3% inode=98%): /tmp 3375 MB (3% inode=98%): /var/tmp 3375 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [18:05:14] (03CR) 10Aklapper: [V:03+2 C:03+2] Correct some option summaries about what they do [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120616 (owner: 10Aklapper) [18:05:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:05:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:08:24] (03PS1) 10Michael Große: fix(Surfacing): make instrumentation platform-aware [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120618 (https://phabricator.wikimedia.org/T386490) [18:09:03] (03PS1) 10Aklapper: Correct indentation of some lines [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120619 [18:09:06] (03PS1) 10Michael Große: feat(Surfacing): track performance metrics with statslib [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120620 (https://phabricator.wikimedia.org/T386490) [18:09:41] (03PS2) 10Michael Große: feat(Surfacing): track performance metrics with statslib [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120620 (https://phabricator.wikimedia.org/T386490) [18:09:56] (03CR) 10Aklapper: [V:03+2 C:03+2] Correct indentation of some lines [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120619 (owner: 10Aklapper) [18:11:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:11:49] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:12:21] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:13:29] (03PS1) 10Aklapper: Add missing pht() to some options [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120621 [18:14:00] (03CR) 10Aklapper: [V:03+2 C:03+2] Add missing pht() to some options [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120621 (owner: 10Aklapper) [18:14:08] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Apply JDK 11 update - eevans@cumin1002 [18:15:10] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10560603 (10Ottomata) [18:15:43] (03PS2) 10Bernard Wang: Update Search AB test config, increase bucketing/sampling rates for eu/ca, deploy to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120609 (https://phabricator.wikimedia.org/T386734) [18:16:47] (03CR) 10Scott French: "Thanks in advance for the review, Matthew. This is the component to which I'd propose we include the backport packages from apt-staging." [puppet] - 10https://gerrit.wikimedia.org/r/1120586 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [18:16:53] (03CR) 10RLazarus: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1120462 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [18:21:11] (03PS1) 10Aklapper: Add missing descriptions to options [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120622 [18:21:34] (03CR) 10Aklapper: [V:03+2 C:03+2] Add missing descriptions to options [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120622 (owner: 10Aklapper) [18:24:17] (03PS1) 10DCausse: Optimize CirrusSearch index update to trigger only when necessary [extensions/PageAssessments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120624 [18:24:20] (03PS2) 10Ssingh: Release dnsdist 1.9.8-1+wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 [18:25:22] (03PS1) 10DCausse: Optimize CirrusSearch index update to trigger only when necessary [extensions/PageAssessments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120625 [18:25:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10560659 (10phaultfinder) [18:25:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/PageAssessments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120624 (owner: 10DCausse) [18:26:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/PageAssessments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120625 (owner: 10DCausse) [18:28:06] (03PS1) 10Daimona Eaytoy: [WIP] CampaignEvents enwiki + mswikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120626 [18:29:20] (03CR) 10CI reject: [V:04-1] [WIP] CampaignEvents enwiki + mswikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120626 (owner: 10Daimona Eaytoy) [18:29:33] (03PS1) 10JMeybohm: Update for k8s >=1.30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984) [18:31:10] (03CR) 10CI reject: [V:04-1] Update for k8s >=1.30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [18:32:04] (03PS1) 10Esanders: Follow-up Iccb97796: Remove ru.wiki from DiscussionTools visual enhancements deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120630 (https://phabricator.wikimedia.org/T379102) [18:32:12] (03CR) 10CI reject: [V:04-1] Follow-up Iccb97796: Remove ru.wiki from DiscussionTools visual enhancements deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120630 (https://phabricator.wikimedia.org/T379102) (owner: 10Esanders) [18:34:49] (03PS2) 10Esanders: Follow-up Iccb97796: Remove ru.wiki from DiscussionTools visual enhancements deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120630 (https://phabricator.wikimedia.org/T379102) [18:39:14] (03PS1) 10Daimona Eaytoy: Introduce config setting to disable default event-organizer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120632 (https://phabricator.wikimedia.org/T386290) [18:41:31] (03CR) 10Pppery: "Beware T275334" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120632 (https://phabricator.wikimedia.org/T386290) (owner: 10Daimona Eaytoy) [18:41:55] (03PS2) 10Daimona Eaytoy: enwiki, mswikt: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120626 (https://phabricator.wikimedia.org/T386290) [18:42:46] (03CR) 10CI reject: [V:04-1] enwiki, mswikt: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120626 (https://phabricator.wikimedia.org/T386290) (owner: 10Daimona Eaytoy) [18:45:18] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 on cloudvirt1047 - https://phabricator.wikimedia.org/T386083#10560719 (10VRiley-WMF) After looking into this, it seems it was a small glitch with the memory, h... [18:45:24] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 on cloudvirt1047 - https://phabricator.wikimedia.org/T386083#10560720 (10VRiley-WMF) 05Open→03Resolved [18:47:52] (03CR) 10Ssingh: "This needs more work. I need to update debian/ as well so not ready for review." [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 (owner: 10Ssingh) [18:55:39] (03PS1) 10Aklapper: Improve pht() log messages: Clarify logout vs disabling account [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120635 [18:56:10] (03CR) 10Aklapper: [V:03+2 C:03+2] Improve pht() log messages: Clarify logout vs disabling account [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120635 (owner: 10Aklapper) [18:59:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:59:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:00:05] dancy and andre: How many deployers does it take to do MediaWiki train - Utc-7+Utc-0 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T1900). [19:00:07] (03PS1) 10Aklapper: Fix variable name in last commit [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120636 [19:00:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:00:24] (03CR) 10Aklapper: [V:03+2 C:03+2] Fix variable name in last commit [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120636 (owner: 10Aklapper) [19:00:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:01:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:04:52] o/ [19:05:00] (03PS1) 10Aklapper: Log username instead of PHID via pht() [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120638 [19:05:25] (03CR) 10Aklapper: [V:03+2 C:03+2] Log username instead of PHID via pht() [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120638 (owner: 10Aklapper) [19:08:06] PROBLEM - Webrequests Varnishkafka log producer on cp5019 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:08:06] PROBLEM - statsv Varnishkafka log producer on cp5024 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:08:07] PROBLEM - Webrequests Varnishkafka log producer on cp5024 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:08:58] PROBLEM - Webrequests Varnishkafka log producer on cp3071 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:08:58] PROBLEM - statsv Varnishkafka log producer on cp3066 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:08:59] PROBLEM - Webrequests Varnishkafka log producer on cp3066 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:09:01] PROBLEM - PyBal backends health check on lvs3010 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp3069.esams.wmnet, cp3066.esams.wmnet, cp3068.esams.wmnet, cp3073.esams.wmnet, cp3067.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3069.esams.wmnet, cp3066.esams.wmnet, cp3068.esams.wmnet, cp3073.esams.wmnet, cp3067.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:09:01] PROBLEM - PyBal backends health check on lvs3008 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp3069.esams.wmnet, cp3066.esams.wmnet, cp3068.esams.wmnet, cp3067.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3069.esams.wmnet, cp3068.esams.wmnet, cp3067.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:09:02] From my side, we are very down [19:09:04] PROBLEM - statsv Varnishkafka log producer on cp5023 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:09:04] PROBLEM - Webrequests Varnishkafka log producer on cp5020 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:09:05] PROBLEM - Webrequests Varnishkafka log producer on cp5023 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:09:07] PROBLEM - statsv Varnishkafka log producer on cp5020 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:09:08] PROBLEM - statsv Varnishkafka log producer on cp5019 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:09:11] #page [19:09:22] PROBLEM - Webrequests Varnishkafka log producer on cp3068 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:09:30] PROBLEM - statsv Varnishkafka log producer on cp3071 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:09:30] PROBLEM - Webrequests Varnishkafka log producer on cp3073 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:09:31] PROBLEM - statsv Varnishkafka log producer on cp3068 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:09:33] PROBLEM - statsv Varnishkafka log producer on cp3069 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:09:34] PROBLEM - Webrequests Varnishkafka log producer on cp3067 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:09:36] PROBLEM - statsv Varnishkafka log producer on cp3067 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:09:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [19:09:58] PROBLEM - statsv Varnishkafka log producer on cp3073 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:09:58] PROBLEM - Webrequests Varnishkafka log producer on cp3069 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:10:01] RECOVERY - PyBal backends health check on lvs3010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:10:01] RECOVERY - PyBal backends health check on lvs3008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:10:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:10:04] PROBLEM - statsv Varnishkafka log producer on cp5022 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:10:04] RECOVERY - Webrequests Varnishkafka log producer on cp5020 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:10:05] PROBLEM - Webrequests Varnishkafka log producer on cp5022 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:10:07] RECOVERY - statsv Varnishkafka log producer on cp5024 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:10:08] RECOVERY - Webrequests Varnishkafka log producer on cp5024 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:10:19] !incidents [19:10:19] 5682 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [19:10:30] !ack 5682 [19:10:31] 5682 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [19:11:06] RECOVERY - statsv Varnishkafka log producer on cp5022 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:11:06] RECOVERY - Webrequests Varnishkafka log producer on cp5022 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:11:10] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp5017 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:11:10] PROBLEM - Webrequests Varnishkafka log producer on cp5017 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:11:13] I'm seeing this https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m&viewPanel=17 [19:11:41] https://grafana-rw.wikimedia.org/d/000000479/cdn-frontend-traffic?forceLogin=true&from=now-15m&orgId=1&to=now&viewPanel=13 appears to be recovering...? [19:12:06] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp5017 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:12:06] RECOVERY - Webrequests Varnishkafka log producer on cp5017 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:12:20] RECOVERY - Webrequests Varnishkafka log producer on cp3068 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:12:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2243.codfw.wmnet with OS bookworm [19:12:30] RECOVERY - statsv Varnishkafka log producer on cp3068 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:12:38] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425#10560906 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2243.codfw.wmnet with OS bookworm [19:12:58] RECOVERY - Webrequests Varnishkafka log producer on cp3069 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:13:04] PROBLEM - Webrequests Varnishkafka log producer on cp5020 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:13:26] looking [19:13:30] RECOVERY - Webrequests Varnishkafka log producer on cp3067 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:13:30] RECOVERY - statsv Varnishkafka log producer on cp3069 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:13:31] RECOVERY - statsv Varnishkafka log producer on cp3067 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:14:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [19:15:04] RECOVERY - Webrequests Varnishkafka log producer on cp5019 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:15:04] RECOVERY - statsv Varnishkafka log producer on cp5019 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:16:48] 06SRE, 07Wikimedia-Incident: 503 Service Unavailable on all production - https://phabricator.wikimedia.org/T386740#10560915 (10Iniquity) [19:16:58] RECOVERY - statsv Varnishkafka log producer on cp3066 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:16:58] RECOVERY - Webrequests Varnishkafka log producer on cp3066 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:20:09] 06SRE, 07Wikimedia-Incident: 503 Service Unavailable on all production - https://phabricator.wikimedia.org/T386740#10560934 (10Iniquity) {F58417922} https://www.wikimediastatus.net/ [19:23:17] Nothing for the record that I haven't rolled the train to group0 today and I'll wait until I see an all-clear. [19:23:21] *Noting ... [19:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10560969 (10phaultfinder) [19:25:06] RECOVERY - statsv Varnishkafka log producer on cp5023 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:25:06] RECOVERY - Webrequests Varnishkafka log producer on cp5023 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:26:42] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Apply JDK 11 update - eevans@cumin1002 [19:29:04] RECOVERY - Webrequests Varnishkafka log producer on cp5020 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:29:04] RECOVERY - statsv Varnishkafka log producer on cp5020 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:33:30] RECOVERY - Webrequests Varnishkafka log producer on cp3073 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:33:58] RECOVERY - statsv Varnishkafka log producer on cp3073 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:34:58] RECOVERY - Webrequests Varnishkafka log producer on cp3071 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:35:01] 06SRE, 07Wikimedia-Incident: 503 Service Unavailable on all production - https://phabricator.wikimedia.org/T386740#10561001 (10ssingh) We had a spike in requests and so this was "expected". Please let us know if you continue to see issues. [19:35:30] RECOVERY - statsv Varnishkafka log producer on cp3071 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:36:34] (03PS1) 10Jdlrobson: Fix session tick logging [extensions/WikimediaEvents] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120641 (https://phabricator.wikimedia.org/T386229) [19:36:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2243.codfw.wmnet with reason: host reimage [19:40:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2243.codfw.wmnet with reason: host reimage [19:42:12] 06SRE, 07Wikimedia-Incident: 503 Service Unavailable on all production - https://phabricator.wikimedia.org/T386740#10561035 (10Iniquity) >>! In T386740#10561001, @ssingh wrote: > We had a spike in requests and so this was "expected". Please let us know if you continue to see issues. Everything is OK now :) [19:44:22] 06SRE, 07Wikimedia-Incident: 503 Service Unavailable on all production - https://phabricator.wikimedia.org/T386740#10561043 (10Iniquity) I want to know for the future, this is not the first time I have reported about "Service Unavailable". Should I do this in the future? [19:44:44] I see the magic words "Everything is OK now" so I'm proceeding with the train. [19:45:36] (03PS1) 10Michael Große: fix(surfacing): add dependency for link-icon in popup header [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120643 [19:46:30] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120644 (https://phabricator.wikimedia.org/T382368) [19:46:31] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120644 (https://phabricator.wikimedia.org/T382368) (owner: 10TrainBranchBot) [19:47:20] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120644 (https://phabricator.wikimedia.org/T382368) (owner: 10TrainBranchBot) [19:55:18] (03PS5) 10Krinkle: mediawiki: Add rewrite rule to fix serving of /.well-known static files [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) [19:57:24] (03PS1) 10Michael Große: testwiki: enable surfacing structured task experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120646 (https://phabricator.wikimedia.org/T386739) [19:58:29] (03CR) 10DLynch: [C:03+1] Follow-up Iccb97796: Remove ru.wiki from DiscussionTools visual enhancements deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120630 (https://phabricator.wikimedia.org/T379102) (owner: 10Esanders) [19:58:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120630 (https://phabricator.wikimedia.org/T379102) (owner: 10Esanders) [19:59:34] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.17 refs T382368 [19:59:39] T382368: 1.44.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T382368 [20:02:00] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:02:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:02:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2243.codfw.wmnet with OS bookworm [20:02:25] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425#10561120 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2243.codfw.wmnet with OS bookworm completed: - db2243 (**WARN**) - Remov... [20:03:15] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425#10561126 (10Jhancock.wm) 05Open→03Resolved [20:03:44] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425#10561129 (10Jhancock.wm) @Marostegui this is ready for you. [20:07:33] (03Restored) 10Ahmon Dancy: logspam: Consolidate CurlFactory cURL errors [puppet] - 10https://gerrit.wikimedia.org/r/1056221 (owner: 10Ahmon Dancy) [20:07:41] (03PS4) 10Ahmon Dancy: logspam: Consolidate CurlFactory cURL errors [puppet] - 10https://gerrit.wikimedia.org/r/1056221 (https://phabricator.wikimedia.org/T371633) [20:16:11] (03PS1) 10Aklapper: Further decrease number of queried last transactions (performance) [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120649 (https://phabricator.wikimedia.org/T386704) [20:17:09] (03CR) 10Aklapper: [V:03+2 C:03+2] Further decrease number of queried last transactions (performance) [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120649 (https://phabricator.wikimedia.org/T386704) (owner: 10Aklapper) [20:19:58] (03PS5) 10Ahmon Dancy: logspam: Consolidate CurlFactory cURL errors [puppet] - 10https://gerrit.wikimedia.org/r/1056221 (https://phabricator.wikimedia.org/T371633) [20:21:08] (03PS1) 10Aklapper: Decrease account disable threshold after logout [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120651 [20:21:24] (03CR) 10Aklapper: [V:03+2 C:03+2] Decrease account disable threshold after logout [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1120651 (owner: 10Aklapper) [20:22:22] (03CR) 10Ahmon Dancy: logspam: Consolidate CurlFactory cURL errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056221 (https://phabricator.wikimedia.org/T371633) (owner: 10Ahmon Dancy) [20:25:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10561191 (10phaultfinder) [20:27:35] (03PS1) 10Bking: opensearch-cirrus: add repository before attempting plugin install [puppet] - 10https://gerrit.wikimedia.org/r/1120654 (https://phabricator.wikimedia.org/T380752) [20:27:49] (03PS1) 10Scott French: dbctl: pass DbCtlConfiguration to DbConfig [software/spicerack] - 10https://gerrit.wikimedia.org/r/1120648 (https://phabricator.wikimedia.org/T383324) [20:27:49] (03CR) 10Scott French: "Thanks in advance for the review, Riccardo!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1120648 (https://phabricator.wikimedia.org/T383324) (owner: 10Scott French) [20:32:55] (03CR) 10Krinkle: "Scheduled for https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1700. This patch is passing on puppet compiler for a" [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [20:33:50] (03CR) 10Volans: [C:03+1] "LGTM, thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1120648 (https://phabricator.wikimedia.org/T383324) (owner: 10Scott French) [20:34:54] (03CR) 10Daimona Eaytoy: "Ugh, thanks for the pointer. I think this should be fine for the time being, as the worst that can happen is (AIUI) the event-organizer gr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120632 (https://phabricator.wikimedia.org/T386290) (owner: 10Daimona Eaytoy) [20:35:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120632 (https://phabricator.wikimedia.org/T386290) (owner: 10Daimona Eaytoy) [20:36:40] (03PS3) 10Daimona Eaytoy: enwiki, mswikt: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120626 (https://phabricator.wikimedia.org/T386290) [20:38:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120626 (https://phabricator.wikimedia.org/T386290) (owner: 10Daimona Eaytoy) [20:40:49] 06SRE, 07Wikimedia-Incident: 503 Service Unavailable on all production - https://phabricator.wikimedia.org/T386740#10561252 (10ssingh) >>! In T386740#10561043, @Iniquity wrote: > I want to know for the future, this is not the first time I have reported about "Service Unavailable". Should I do this in the futur... [20:45:25] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120654 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [20:49:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119546 (https://phabricator.wikimedia.org/T386313) (owner: 10NMW03) [20:50:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119548 (https://phabricator.wikimedia.org/T386367) (owner: 10NMW03) [20:52:08] (03PS2) 10Bking: opensearch-cirrus: add repository before attempting plugin install [puppet] - 10https://gerrit.wikimedia.org/r/1120654 (https://phabricator.wikimedia.org/T380752) [20:52:17] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120654 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [20:54:27] 06SRE, 10SRE-Access-Requests: Requesting access to Dashboards in Superset / Hive interfaces (like Hue) that do access private data for Mariya Shilova - https://phabricator.wikimedia.org/T386754 (10MShilova_WMF) 03NEW [20:57:06] (03PS3) 10Bking: opensearch-cirrus: add repository before attempting plugin install [puppet] - 10https://gerrit.wikimedia.org/r/1120654 (https://phabricator.wikimedia.org/T380752) [20:59:21] (03PS4) 10Bking: opensearch-cirrus: add repository before attempting plugin install [puppet] - 10https://gerrit.wikimedia.org/r/1120654 (https://phabricator.wikimedia.org/T380752) [20:59:40] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120654 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T2100) [21:00:05] dcausse, kemayo, and Nemoralis: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:14] o/ [21:00:57] o/ [21:03:41] o/ [21:04:39] o/ i can deploy [21:05:15] cjming: o/ thanks! [21:05:20] hi dcausse ! can your 2 go out together? [21:05:25] cjming: yes [21:06:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/PageAssessments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120624 (owner: 10DCausse) [21:06:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/PageAssessments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120625 (owner: 10DCausse) [21:06:41] note that they affect jobs so can't really be tested on mwdebug servers [21:07:03] sounds good - i'll go ahead and sync when ready [21:09:59] (03Merged) 10jenkins-bot: Optimize CirrusSearch index update to trigger only when necessary [extensions/PageAssessments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120624 (owner: 10DCausse) [21:10:17] (03Merged) 10jenkins-bot: Optimize CirrusSearch index update to trigger only when necessary [extensions/PageAssessments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120625 (owner: 10DCausse) [21:10:47] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1120624|Optimize CirrusSearch index update to trigger only when necessary]], [[gerrit:1120625|Optimize CirrusSearch index update to trigger only when necessary]] [21:11:26] (03PS3) 10Esanders: Follow-up Iccb97796: Remove ru.wiki from DiscussionTools visual enhancements deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120630 (https://phabricator.wikimedia.org/T379102) [21:13:45] !log cjming@deploy2002 dcausse, cjming: Backport for [[gerrit:1120624|Optimize CirrusSearch index update to trigger only when necessary]], [[gerrit:1120625|Optimize CirrusSearch index update to trigger only when necessary]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:13:49] !log cjming@deploy2002 dcausse, cjming: Continuing with sync [21:20:23] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120624|Optimize CirrusSearch index update to trigger only when necessary]], [[gerrit:1120625|Optimize CirrusSearch index update to trigger only when necessary]] (duration: 09m 36s) [21:20:39] dcausse: should be live :) [21:20:42] cjming: thanks! :) [21:20:49] yw! [21:20:55] kemayo: still around? [21:21:04] cjming: Yup! [21:23:25] (03Merged) 10jenkins-bot: Follow-up Iccb97796: Remove ru.wiki from DiscussionTools visual enhancements deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120630 (https://phabricator.wikimedia.org/T379102) (owner: 10Esanders) [21:23:57] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1120630|Follow-up Iccb97796: Remove ru.wiki from DiscussionTools visual enhancements deployment (T379102)]] [21:24:00] T379102: [MILESTONE] Offer Usability Improvements as default-on feature at Phase 3 wikis (desktop) - https://phabricator.wikimedia.org/T379102 [21:26:54] !log cjming@deploy2002 esanders, cjming: Backport for [[gerrit:1120630|Follow-up Iccb97796: Remove ru.wiki from DiscussionTools visual enhancements deployment (T379102)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:27:17] Kemayo: on test servers if testable [21:28:43] cjming: Good to go [21:28:50] cool - syncing [21:28:53] !log cjming@deploy2002 esanders, cjming: Continuing with sync [21:29:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10561372 (10phaultfinder) [21:35:28] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120630|Follow-up Iccb97796: Remove ru.wiki from DiscussionTools visual enhancements deployment (T379102)]] (duration: 11m 31s) [21:35:32] T379102: [MILESTONE] Offer Usability Improvements as default-on feature at Phase 3 wikis (desktop) - https://phabricator.wikimedia.org/T379102 [21:35:51] Kemayo: should be live :) [21:36:04] Nemoralis: still around? [21:36:38] yep [21:36:50] (03PS2) 10NMW03: Allow sysops to add/remove "confirmed" on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119546 (https://phabricator.wikimedia.org/T386313) [21:37:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119546 (https://phabricator.wikimedia.org/T386313) (owner: 10NMW03) [21:37:39] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Apply JDK 11 update - eevans@cumin1002 [21:38:13] (03Merged) 10jenkins-bot: Allow sysops to add/remove "confirmed" on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119546 (https://phabricator.wikimedia.org/T386313) (owner: 10NMW03) [21:38:40] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1119546|Allow sysops to add/remove "confirmed" on English Wikivoyage (T386313)]] [21:38:44] T386313: Allow sysops to add "confirmed" on enwikivoyage - https://phabricator.wikimedia.org/T386313 [21:40:40] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Apply JDK 11 update - eevans@cumin1002 [21:41:33] !log cjming@deploy2002 cjming, nmw03: Backport for [[gerrit:1119546|Allow sysops to add/remove "confirmed" on English Wikivoyage (T386313)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:41:48] lgtm [21:41:50] Nemoralis: 1st patch up on test servers [21:41:55] oh good - then syncing [21:41:58] !log cjming@deploy2002 cjming, nmw03: Continuing with sync [21:48:29] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119546|Allow sysops to add/remove "confirmed" on English Wikivoyage (T386313)]] (duration: 09m 48s) [21:48:32] T386313: Allow sysops to add "confirmed" on enwikivoyage - https://phabricator.wikimedia.org/T386313 [21:48:46] (03PS2) 10NMW03: Add "suppressredirect" to "editor" on Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119548 (https://phabricator.wikimedia.org/T386367) [21:49:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119548 (https://phabricator.wikimedia.org/T386367) (owner: 10NMW03) [21:49:44] (03Merged) 10jenkins-bot: Add "suppressredirect" to "editor" on Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119548 (https://phabricator.wikimedia.org/T386367) (owner: 10NMW03) [21:50:14] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1119548|Add "suppressredirect" to "editor" on Russian Wikisource (T386367)]] [21:50:17] T386367: Russian Wikisource needs ''suppressredirect'' right for ''editor'' group - https://phabricator.wikimedia.org/T386367 [21:53:04] Nemoralis: your 2nd patch up on test servers [21:53:05] !log cjming@deploy2002 nmw03, cjming: Backport for [[gerrit:1119548|Add "suppressredirect" to "editor" on Russian Wikisource (T386367)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:53:22] lgtm [21:53:28] !log cjming@deploy2002 nmw03, cjming: Continuing with sync [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250218T2200) [22:00:05] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119548|Add "suppressredirect" to "editor" on Russian Wikisource (T386367)]] (duration: 09m 51s) [22:00:09] T386367: Russian Wikisource needs ''suppressredirect'' right for ''editor'' group - https://phabricator.wikimedia.org/T386367 [22:00:19] thanks [22:07:13] hey cjming, let me know when you're done with backport window. [22:07:43] jan_drewniak: done! all yours [22:07:52] cjming: ok thanks! [22:07:58] Nemoralis: 2nd patch should be live :) [22:08:17] yes, i know :D thanks [22:10:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120641 (https://phabricator.wikimedia.org/T386229) (owner: 10Jdlrobson) [22:12:21] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:48] (03CR) 10Bking: [V:03+2 C:03+2] "Self-merging, as this will not affect anything other than non-production relforge hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1120654 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [22:19:36] (03Merged) 10jenkins-bot: Fix session tick logging [extensions/WikimediaEvents] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120641 (https://phabricator.wikimedia.org/T386229) (owner: 10Jdlrobson) [22:20:07] !log jdrewniak@deploy2002 Started scap sync-world: Backport for [[gerrit:1120641|Fix session tick logging (T386229)]] [22:20:11] T386229: No events being logged to product_metrics.web_base.search_ab_test_session_ticks - https://phabricator.wikimedia.org/T386229 [22:21:37] (03PS1) 10TChin: Eventstreams: Bump image to v0.15.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120674 (https://phabricator.wikimedia.org/T386750) [22:23:05] !log jdrewniak@deploy2002 jdrewniak, jdlrobson: Backport for [[gerrit:1120641|Fix session tick logging (T386229)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10561558 (10phaultfinder) [22:24:41] !log jdrewniak@deploy2002 jdrewniak, jdlrobson: Continuing with sync [22:28:15] (03CR) 10Ottomata: [C:03+1] Eventstreams: Bump image to v0.15.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120674 (https://phabricator.wikimedia.org/T386750) (owner: 10TChin) [22:31:21] !log jdrewniak@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120641|Fix session tick logging (T386229)]] (duration: 11m 13s) [22:31:24] T386229: No events being logged to product_metrics.web_base.search_ab_test_session_ticks - https://phabricator.wikimedia.org/T386229 [22:38:03] (03CR) 10TChin: [C:03+2] Eventstreams: Bump image to v0.15.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120674 (https://phabricator.wikimedia.org/T386750) (owner: 10TChin) [22:39:41] (03Merged) 10jenkins-bot: Eventstreams: Bump image to v0.15.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120674 (https://phabricator.wikimedia.org/T386750) (owner: 10TChin) [22:43:00] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [22:43:30] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [22:45:02] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [22:45:42] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [22:46:03] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [22:46:51] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [22:47:47] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [22:47:51] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [22:48:09] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [22:48:40] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [22:49:49] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [22:50:31] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [22:50:45] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [22:51:32] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [23:24:03] (03PS1) 10Arlolra: Turn on Parsoid Read Views for 31 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386272) [23:24:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10561765 (10phaultfinder) [23:25:09] (03PS2) 10Arlolra: Turn on Parsoid Read Views for 31 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) [23:43:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1149:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1149 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:47:25] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp5020.eqsin.wmnet,service=(cdn|ats-be) [23:48:15] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Apply JDK 11 update - eevans@cumin1002 [23:51:13] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5020.eqsin.wmnet,service=(cdn|ats-be) [reason: repooling; resolved service errors]