[00:00:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635198 (10phaultfinder) [00:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635239 (10phaultfinder) [00:25:13] (03CR) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [00:38:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1127693 [00:38:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1127693 (owner: 10TrainBranchBot) [00:39:49] (03CR) 10Bartosz Dziewoński: [C:03+1] Fix some SUL3 shared domain settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127648 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza) [00:50:42] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1127693 (owner: 10TrainBranchBot) [01:05:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635264 (10phaultfinder) [01:09:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1127696 [01:09:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1127696 (owner: 10TrainBranchBot) [01:10:24] (03CR) 10Ssingh: "Looks good, no major blockers. I will run some dry-runs tomorrow to cement my own understanding." [cookbooks] - 10https://gerrit.wikimedia.org/r/1127537 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [01:14:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudelastic1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:20:05] (03CR) 10Bartosz Dziewoński: [C:03+1] "This resolves T388218, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127648 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza) [01:29:22] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1127696 (owner: 10TrainBranchBot) [01:48:01] (03CR) 10RLazarus: [C:03+1] "Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/1126697 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [02:19:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635274 (10phaultfinder) [02:54:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635280 (10phaultfinder) [04:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:27:16] (03PS11) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) [04:27:17] PROBLEM - Restbase root url on restbase2024 is CRITICAL: connect to address 10.192.16.23 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [05:00:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:01:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:01:51] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127760 [05:02:27] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:03:27] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:03:34] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127763 [05:03:39] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudelastic1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635317 (10phaultfinder) [05:42:53] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127791 [05:49:33] 10ops-eqiad, 06DBA, 06DC-Ops: db1248 crash - https://phabricator.wikimedia.org/T388837#10635321 (10Marostegui) p:05Triage→03Medium #Dc-ops can we reach out to dell about this crash with the above logs? Thanks! [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250314T0600) [06:09:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635327 (10phaultfinder) [06:10:31] (03PS1) 10Marostegui: db1248: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1127792 (https://phabricator.wikimedia.org/T388837) [06:10:58] (03CR) 10Marostegui: [C:03+2] db1248: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1127792 (https://phabricator.wikimedia.org/T388837) (owner: 10Marostegui) [06:27:54] 10ops-eqiad, 06DBA, 06DC-Ops, 13Patch-For-Review: db1248 crash - https://phabricator.wikimedia.org/T388837#10635337 (10Marostegui) I've started mariadb [06:30:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:39:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635345 (10phaultfinder) [06:40:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:45:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:55:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250314T0700) [07:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:29:58] (03PS1) 10Brouberol: mediawiki-dumps-legacy: define networkpolicies to allow egress to the wikireplicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127800 (https://phabricator.wikimedia.org/T386282) [07:31:11] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: define networkpolicies to allow egress to the wikireplicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127800 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [07:33:44] (03PS2) 10Brouberol: mediawiki-dumps-legacy: define networkpolicies to allow egress to the wikireplicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127800 (https://phabricator.wikimedia.org/T386282) [08:22:18] (03PS9) 10Vgutierrez: sre.loadbalancer: upgrade/restart cookbook for liberica [cookbooks] - 10https://gerrit.wikimedia.org/r/1127537 (https://phabricator.wikimedia.org/T388369) [08:23:10] (03CR) 10Vgutierrez: sre.loadbalancer: upgrade/restart cookbook for liberica (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1127537 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [08:26:26] (03CR) 10Vgutierrez: [C:03+2] cumin: Update (liberica|lvs)-drmrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/1127471 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [08:28:33] (03CR) 10Volans: "Thanks for the fix, it looks much cleaner now. I'll leave it to observability as you know better what type of data comes from ldap." [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [08:45:18] (03CR) 10Elukey: [C:03+1] Temporary revert changeprop/changeprop-jobqueue to node 18 images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127166 (owner: 10Aaron Schulz) [08:49:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635481 (10phaultfinder) [08:49:57] (03PS1) 10Ilias Sarantopoulos: ml-services: enable multiprocessing for ptwiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127851 [09:00:42] (03PS16) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [09:04:47] (03CR) 10Elukey: [C:03+2] Temporary revert changeprop/changeprop-jobqueue to node 18 images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127166 (owner: 10Aaron Schulz) [09:05:58] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host db1257.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:13:50] (03CR) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [09:14:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudelastic1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:15:04] (03PS1) 10Vgutierrez: site,hiera: Reimage lvs3010 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1127853 (https://phabricator.wikimedia.org/T384477) [09:15:47] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127853 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [09:16:03] (03CR) 10Vgutierrez: [C:04-2] "to be merged on 2025-03-17" [puppet] - 10https://gerrit.wikimedia.org/r/1127853 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [09:18:30] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1257.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:19:34] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host db1257.eqiad.wmnet with OS bookworm [09:19:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10635577 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host db1257.eqiad.wmnet with OS bookworm [09:20:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10635579 (10elukey) I've ran the new/testing version of the provision cookbook for Supermicro, everything worked! Hope to release it very soon so we get rid of these issues. [09:21:42] (03PS3) 10Arnaudb: nftables: add a newline at the end of GERRIT_ABUSERS_ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) [09:21:42] (03CR) 10Arnaudb: "I manually edited the file and restarted nftables with no success" [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [09:22:52] (03CR) 10Arnaudb: "with no problem*" [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [09:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635586 (10phaultfinder) [09:30:31] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [09:30:56] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [09:31:20] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [09:31:29] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [09:36:37] !log set 400G retention for udp_localhost-err topic in kafka-logging eqiad [09:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:15] !log set 1TB retention for udp_localhost-warning topic in kafka-logging eqiad [09:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:33] (03PS1) 10DCausse: kartotherian: use wdqs-internal-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 [09:52:22] (03CR) 10Btullis: mediawiki-dumps-legacy: define networkpolicies to allow egress to the wikireplicas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127800 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:56:36] (03PS2) 10DCausse: kartotherian: use wdqs-internal-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 (https://phabricator.wikimedia.org/T388860) [09:56:55] (03PS3) 10Brouberol: mediawiki-dumps-legacy: define networkpolicies to allow egress to the wikireplicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127800 (https://phabricator.wikimedia.org/T386282) [09:59:17] (03CR) 10Elukey: "Hey David! I'll deploy the change next week, it LGTM but I think we could change the helmfile's value.yaml as well to change the discovery" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 (https://phabricator.wikimedia.org/T388860) (owner: 10DCausse) [09:59:50] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdd) failed on ms-be1081 - https://phabricator.wikimedia.org/T388697#10635636 (10MatthewVernon) @VRiley-WMF thanks for the update! We used to keep spare disks for ms-be* nodes in both eqiad (T331987) and codfw (T331988). Have these all been used up n... [10:02:47] (03CR) 10Brouberol: mediawiki-dumps-legacy: define networkpolicies to allow egress to the wikireplicas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127800 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [10:03:36] (03CR) 10DCausse: "oh good point, completely missed that! fixed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 (https://phabricator.wikimedia.org/T388860) (owner: 10DCausse) [10:03:36] (03PS3) 10DCausse: kartotherian: use wdqs-internal-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 (https://phabricator.wikimedia.org/T388860) [10:07:19] (03PS1) 10Kevin Bazira: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127857 (https://phabricator.wikimedia.org/T385970) [10:08:18] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1257.eqiad.wmnet with OS bookworm [10:08:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10635645 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host db1257.eqiad.wmnet with OS bookworm executed with errors: - db1257 (**FAIL**)... [10:09:11] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: define networkpolicies to allow egress to the wikireplicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127800 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [10:09:22] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: define networkpolicies to allow egress to the wikireplicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127800 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [10:09:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635647 (10phaultfinder) [10:09:49] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host db1257.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:10:57] (03PS1) 10Hnowlan: mw-(web|api-ext): scale up in anticipation of switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127859 (https://phabricator.wikimedia.org/T385155) [10:10:59] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1257.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:16:27] (03PS17) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:17:20] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host db1257.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:19:15] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1257.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:20:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635662 (10phaultfinder) [10:20:43] (03CR) 10Volans: "Nice addition! Looks pretty good. I've left some comments inline, mostly optional or suggestions to be a bit more DRY." [cookbooks] - 10https://gerrit.wikimedia.org/r/1127537 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [10:22:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10635668 (10elukey) >>! In T384979#10635579, @elukey wrote: > I've ran the new/testing version of the provision cookbook for Supermicro, everything worked! Hope to release it very soon... [10:27:41] (03CR) 10Elukey: kartotherian: use wdqs-internal-main (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 (https://phabricator.wikimedia.org/T388860) (owner: 10DCausse) [10:27:47] (03PS1) 10Brouberol: site: restore role to cloudelastic1011 [puppet] - 10https://gerrit.wikimedia.org/r/1127863 [10:29:02] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1127863 (owner: 10Brouberol) [10:29:24] (03CR) 10Brouberol: [C:03+2] site: restore role to cloudelastic1011 [puppet] - 10https://gerrit.wikimedia.org/r/1127863 (owner: 10Brouberol) [10:32:54] (03PS4) 10DCausse: kartotherian: use wdqs-internal-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 (https://phabricator.wikimedia.org/T388860) [10:32:57] (03CR) 10DCausse: kartotherian: use wdqs-internal-main (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 (https://phabricator.wikimedia.org/T388860) (owner: 10DCausse) [10:33:16] (03PS2) 10Volans: sre.deploy: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124399 [10:33:43] (03CR) 10Volans: [C:03+2] sre.deploy: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124399 (owner: 10Volans) [10:37:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [10:37:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [10:39:48] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [10:40:41] (03Merged) 10jenkins-bot: sre.deploy: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124399 (owner: 10Volans) [10:40:58] (03CR) 10Elukey: [C:03+1] tests: remove unnecessary vulture setting [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125956 (owner: 10Volans) [10:46:02] (03CR) 10Elukey: "Hi! The change makes sense and LGTM, but given the outage that it caused I'd like to seek consensus from more httpd-config-experts before " [puppet] - 10https://gerrit.wikimedia.org/r/1123622 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [10:46:37] (03CR) 10Lucas Werkmeister (WMDE): "Thanks! As a deployer, I definitely assumed in the past that something like this was already in place 😬 so it’s great to have it now 🎉" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127110 (https://phabricator.wikimedia.org/T341412) (owner: 10Hashar) [10:47:43] (03CR) 10Elukey: [C:03+1] tests: remove unnecessary vulture setting [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125954 (owner: 10Volans) [10:48:42] (03PS1) 10Volans: CHANGELOG: add changelogs for release v5.1.1 [software/cumin] - 10https://gerrit.wikimedia.org/r/1127872 [10:48:55] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v5.1.1 [software/cumin] - 10https://gerrit.wikimedia.org/r/1127872 (owner: 10Volans) [10:49:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635688 (10phaultfinder) [10:55:30] (03CR) 10Elukey: [C:03+1] "I left a comment for a use case of timer.cancel(), but I am probably missing a corner case so feel free to proceed in case :)" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125953 (owner: 10Volans) [10:57:21] !log set 150GB (per 6x partition = ~1TB) retention for udp_localhost-warning topic in kafka-logging eqiad [10:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:05] !log set 80GB (per 6x partition ~500GB) retention for udp_localhost-err topic in kafka-logging eqiad [10:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:46] (03CR) 10Elukey: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 (https://phabricator.wikimedia.org/T388860) (owner: 10DCausse) [10:59:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on cloudelastic1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250314T0700) [11:00:05] jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250314T1100). [11:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:30] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v5.1.1 [software/cumin] - 10https://gerrit.wikimedia.org/r/1127872 (owner: 10Volans) [11:08:26] (03Abandoned) 10Hashar: gerrit: ban bad crawler [puppet] - 10https://gerrit.wikimedia.org/r/1127520 (owner: 10Hashar) [11:09:50] (03PS1) 10Volans: Upstream release v5.1.1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1127874 [11:10:16] (03CR) 10Volans: [C:03+2] Upstream release v5.1.1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1127874 (owner: 10Volans) [11:13:20] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on 9 hosts with reason: Adding the hosts to the analytics hadoop cluster in batches. this is part of the next batch [11:13:58] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-worker1199.eqiad.wmnet with reason: Adding the hosts to the analytics hadoop cluster in batches. this is part of the next batch [11:19:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635733 (10phaultfinder) [11:21:48] (03CR) 10Volans: "reply and new question inline" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125953 (owner: 10Volans) [11:24:28] (03Merged) 10jenkins-bot: Upstream release v5.1.1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1127874 (owner: 10Volans) [11:36:52] !log uploaded cumin_5.1.1 to apt.wikimedia.org bullseye-wikimedia [11:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:11] (03PS1) 10Hashar: gerrit: group similare prefixes under gerrit_abusers [puppet] - 10https://gerrit.wikimedia.org/r/1127875 [11:39:11] (03PS1) 10Hashar: gerrit: move ByteDance blocks from Apache to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1127876 (https://phabricator.wikimedia.org/T375996) [11:40:08] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance for datacenter switchover from eqiad to codfw [11:40:24] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=99) for datacenter switchover from eqiad to codfw [11:49:14] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [11:49:35] (03PS1) 10Hnowlan: switchdc: delete Job objects for mw-cron due to library support [cookbooks] - 10https://gerrit.wikimedia.org/r/1127878 (https://phabricator.wikimedia.org/T385155) [11:49:41] (03CR) 10CI reject: [V:04-1] switchdc: delete Job objects for mw-cron due to library support [cookbooks] - 10https://gerrit.wikimedia.org/r/1127878 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [11:50:01] (03PS2) 10Hnowlan: switchdc: delete Job objects for mw-cron due to library support [cookbooks] - 10https://gerrit.wikimedia.org/r/1127878 (https://phabricator.wikimedia.org/T385155) [11:52:40] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance for datacenter switchover from eqiad to codfw [11:52:57] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) for datacenter switchover from eqiad to codfw [11:59:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635915 (10phaultfinder) [12:00:31] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance for datacenter switchover from eqiad to codfw [12:00:46] live test ^ [12:02:51] !log root@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [12:02:53] !log root@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [12:02:55] !log root@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [12:02:57] !log root@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [12:03:03] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) for datacenter switchover from eqiad to codfw [12:06:15] (03PS1) 10Btullis: mediawiki: Update the dumps job template to support write access [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127879 (https://phabricator.wikimedia.org/T352650) [12:31:00] (03CR) 10CI reject: [V:04-1] switchdc: delete Job objects for mw-cron due to library support [cookbooks] - 10https://gerrit.wikimedia.org/r/1127878 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [12:40:06] (03PS5) 10Hnowlan: switchdc: delete Job objects for mw-cron due to library support [cookbooks] - 10https://gerrit.wikimedia.org/r/1127878 (https://phabricator.wikimedia.org/T385155) [12:45:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635985 (10phaultfinder) [12:46:33] (03CR) 10CI reject: [V:04-1] switchdc: delete Job objects for mw-cron due to library support [cookbooks] - 10https://gerrit.wikimedia.org/r/1127878 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [12:49:24] (03PS1) 10Clément Goubert: mediawiki: Change kafka topic for rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) [12:57:21] RECOVERY - Restbase root url on restbase2024 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 4.914 second response time https://wikitech.wikimedia.org/wiki/RESTBase [13:04:52] (03PS6) 10Hnowlan: switchdc: delete Job objects for mw-cron due to library support [cookbooks] - 10https://gerrit.wikimedia.org/r/1127878 (https://phabricator.wikimedia.org/T385155) [13:07:22] (03PS1) 10Hashar: Remove obsolete $wgMediaInfoMediaSearchHasLtrPlugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127886 (https://phabricator.wikimedia.org/T297863) [13:09:07] (03CR) 10Hashar: "This is part of removing unused configuration settings https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127886 (https://phabricator.wikimedia.org/T297863) (owner: 10Hashar) [13:11:25] (03PS1) 10Hashar: Remove obsolete $wgMinervaApplyKnownTemplateHacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127887 (https://phabricator.wikimedia.org/T362727) [13:13:44] !log installed cumin v5.1.1 on cloudcumin* and cuminunpriv* hosts [13:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:04] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudgw1003.eqiad.wmnet [13:15:26] (03CR) 10Brouberol: [C:03+1] "We tested the change by injecting the security context dynamically via the airflow task in charge of creating the pod, and it worked." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127879 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis) [13:15:35] (03CR) 10AikoChou: [C:03+1] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127857 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [13:16:02] (03PS1) 10Hashar: beta: remove obsolete $wgMwEmbedModuleConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127888 (https://phabricator.wikimedia.org/T100106) [13:19:06] (03PS1) 10Hashar: Remove obsolete $wgNoticeFundraisingUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127889 [13:21:27] (03PS1) 10Hashar: Remove obsolete $wgNoticeReporterDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127890 (https://phabricator.wikimedia.org/T232912) [13:24:11] (03CR) 10Hashar: "I'd self merge it since that is solely for Beta, but who knows!?! :b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127888 (https://phabricator.wikimedia.org/T100106) (owner: 10Hashar) [13:24:39] (03CR) 10Hashar: "This is part of removing unused configuration settings https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127887 (https://phabricator.wikimedia.org/T362727) (owner: 10Hashar) [13:24:43] (03CR) 10Hashar: "This is part of removing unused configuration settings https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127889 (owner: 10Hashar) [13:24:46] (03CR) 10Hashar: "This is part of removing unused configuration settings https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127890 (https://phabricator.wikimedia.org/T232912) (owner: 10Hashar) [13:24:52] (03CR) 10Jforrester: "Yeah, this is fine (but remember not to deploy on a Friday, as this will show up next scap)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127888 (https://phabricator.wikimedia.org/T100106) (owner: 10Hashar) [13:27:01] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127857 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [13:27:49] (03PS1) 10Hashar: Remove obsolete $wgParserCacheNewKeySchemaRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127894 (https://phabricator.wikimedia.org/T373037) [13:28:28] (03Merged) 10jenkins-bot: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127857 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [13:33:12] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [13:36:54] (03CR) 10Ladsgroup: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127894 (https://phabricator.wikimedia.org/T373037) (owner: 10Hashar) [13:37:26] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:37] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:42:13] (03PS1) 10Hashar: Remove obsolete $wgPopupsEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127897 (https://phabricator.wikimedia.org/T267211) [13:43:14] (03CR) 10Hashar: "This is part of removing unused configuration settings https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127897 (https://phabricator.wikimedia.org/T267211) (owner: 10Hashar) [13:43:44] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10636220 (10MatthewVernon) >>! In T378922#10623769, @Jelto wrote: > @MatthewVernon, what do you need to create credentials for the... [13:47:09] (03PS1) 10Hashar: Remove obsolete $wgPopupsOptInStateForNewAccounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127898 (https://phabricator.wikimedia.org/T364347) [13:48:41] (03CR) 10Hashar: "This is part of removing unused configuration settings https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127898 (https://phabricator.wikimedia.org/T364347) (owner: 10Hashar) [13:49:30] (03CR) 10Jforrester: [C:03+1] Drop CodeEditorEnableCore flag: always true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125095 (owner: 10Hashar) [13:51:56] (03CR) 10Volans: [C:03+2] tests: remove unnecessary vulture setting [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125956 (owner: 10Volans) [13:52:12] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10636237 (10MatthewVernon) >>! In T378922#10624358, @jcrespo wrote: >>>! In T378922#10623769, @Jelto wrote: >>> So that sounds prom... [13:52:23] (03PS1) 10Hashar: Remove obsolete $wgRelatedArticlesLoggingBucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127900 (https://phabricator.wikimedia.org/T202306) [13:52:40] (03CR) 10Ssingh: site,hiera: Reimage lvs3010 as liberica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1127853 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [13:53:24] (03CR) 10Hashar: "This is part of removing unused configuration settings https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127900 (https://phabricator.wikimedia.org/T202306) (owner: 10Hashar) [13:54:19] (03PS2) 10Vgutierrez: site,hiera: Reimage lvs3010 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1127853 (https://phabricator.wikimedia.org/T384477) [13:54:27] (03CR) 10Vgutierrez: site,hiera: Reimage lvs3010 as liberica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1127853 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [13:55:48] (03CR) 10Ssingh: [C:03+1] site,hiera: Reimage lvs3010 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1127853 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [13:56:02] (03CR) 10Jforrester: "Yeah, let's abandon." [skins/Vector] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126550 (https://phabricator.wikimedia.org/T388475) (owner: 10Jforrester) [13:56:05] (03Abandoned) 10Jforrester: Fix missing parens in TableOfContents.less [skins/Vector] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126550 (https://phabricator.wikimedia.org/T388475) (owner: 10Jforrester) [13:56:16] (03CR) 10Hashar: "Thank you for the clarification! 😎" [puppet] - 10https://gerrit.wikimedia.org/r/1127520 (owner: 10Hashar) [13:58:48] (03PS1) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1127902 (https://phabricator.wikimedia.org/T388388) [13:59:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:59:29] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:00:12] (03CR) 10Ssingh: sre.loadbalancer: upgrade/restart cookbook for liberica (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1127537 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [14:00:17] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 1.895 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:00:19] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53657 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:04:08] (03Merged) 10jenkins-bot: tests: remove unnecessary vulture setting [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125956 (owner: 10Volans) [14:04:52] (03PS6) 10Kamila Součková: shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) [14:10:04] (03PS1) 10Vgutierrez: liberica,hiera: Add IPv6 endpoints for prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/1127907 (https://phabricator.wikimedia.org/T379238) [14:10:42] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127907 (https://phabricator.wikimedia.org/T379238) (owner: 10Vgutierrez) [14:14:19] (03PS2) 10Vgutierrez: liberica,hiera: Add IPv6 endpoints for prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/1127907 (https://phabricator.wikimedia.org/T379238) [14:14:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10636267 (10phaultfinder) [14:15:38] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127907 (https://phabricator.wikimedia.org/T379238) (owner: 10Vgutierrez) [14:19:50] (03CR) 10CI reject: [V:04-1] shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [14:20:21] (03CR) 10JMeybohm: shellbox-video: use the correct helm version in each cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [14:25:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10636280 (10phaultfinder) [14:29:16] (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1127907 (https://phabricator.wikimedia.org/T379238) (owner: 10Vgutierrez) [14:44:33] (03CR) 10Bernard Wang: [C:03+1] Remove obsolete $wgMinervaApplyKnownTemplateHacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127887 (https://phabricator.wikimedia.org/T362727) (owner: 10Hashar) [14:49:16] (03PS10) 10Vgutierrez: sre.loadbalancer: upgrade/restart cookbook for liberica [cookbooks] - 10https://gerrit.wikimedia.org/r/1127537 (https://phabricator.wikimedia.org/T388369) [14:49:23] (03CR) 10Vgutierrez: sre.loadbalancer: upgrade/restart cookbook for liberica (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1127537 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [14:53:33] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886 (10RobH) 03NEW [14:54:11] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10636407 (10RobH) [14:54:12] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10636408 (10ssingh) a:03RobH [14:55:23] 10ops-codfw, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe201[56] - https://phabricator.wikimedia.org/T388887 (10RobH) 03NEW [14:55:39] 10ops-codfw, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe201[56] - https://phabricator.wikimedia.org/T388887#10636427 (10RobH) [14:55:40] !log kafka-logging reduce mediawiki.httpd.accesslog topic retention from 172800000ms (2d) to 129600000ms (1.5d) [14:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:06] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10636429 (10RobH) p:05Medium→03High I'll open a task for this later today and get parts sent. A bus error is typically an issue on the mainboard but we'll find out! [14:57:32] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10636443 (10ssingh) Thanks Rob! Host is depooled so can be worked on any time. [15:01:25] (03CR) 10Cwhite: grafana: Normalize user fields and validate input in LDAP sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [15:03:02] (03PS7) 10Kamila Součková: shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) [15:03:43] RECOVERY - Disk space on kafka-logging1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-logging1004&var-datasource=eqiad+prometheus/ops [15:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:55] (03PS1) 10Btullis: mediawiki: Use the servergroup to configure the dumps feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127916 (https://phabricator.wikimedia.org/T352650) [15:13:35] (03CR) 10Phuedx: [C:03+1] Remove obsolete $wgPopupsEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127897 (https://phabricator.wikimedia.org/T267211) (owner: 10Hashar) [15:14:22] (03CR) 10Phuedx: [C:03+1] Remove obsolete $wgPopupsOptInStateForNewAccounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127898 (https://phabricator.wikimedia.org/T364347) (owner: 10Hashar) [15:15:28] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127902 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [15:18:59] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [15:19:02] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [15:19:17] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [15:19:33] (03CR) 10Nik Gkountas: [C:03+1] AX: Add quick survey for MinT for Wikireaders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126617 (https://phabricator.wikimedia.org/T381886) (owner: 10Abijeet Patro) [15:19:39] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [15:20:34] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [15:20:40] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [15:20:43] (03CR) 10JMeybohm: [C:03+2] k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1127902 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [15:20:48] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [15:20:50] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [15:27:26] (03PS8) 10Kamila Součková: shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) [15:28:43] (03CR) 10Volans: [C:03+1] "Great! LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1127537 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [15:29:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10636549 (10phaultfinder) [15:30:42] (03PS1) 10JMeybohm: Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1127920 [15:33:19] (03CR) 10JMeybohm: [C:03+2] Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1127920 (owner: 10JMeybohm) [15:34:54] (03PS1) 10JMeybohm: aptrepo: Add bullseye component kubernetes131 [puppet] - 10https://gerrit.wikimedia.org/r/1127921 (https://phabricator.wikimedia.org/T341984) [15:36:17] (03PS1) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1127922 (https://phabricator.wikimedia.org/T388388) [15:36:27] (03CR) 10JMeybohm: [C:03+2] aptrepo: Add bullseye component kubernetes131 [puppet] - 10https://gerrit.wikimedia.org/r/1127921 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:52] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye [15:38:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10636571 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be20... [15:39:25] (03PS1) 10Slyngshede: data.yaml offboarding rook [puppet] - 10https://gerrit.wikimedia.org/r/1127923 [15:41:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [15:41:36] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10636581 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye [15:41:50] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye [15:41:57] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10636582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be20... [15:42:51] (03PS2) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1127922 (https://phabricator.wikimedia.org/T388388) [15:43:07] (03CR) 10JHathaway: [C:03+1] data.yaml offboarding rook [puppet] - 10https://gerrit.wikimedia.org/r/1127923 (owner: 10Slyngshede) [15:44:24] (03CR) 10Slyngshede: [C:03+2] data.yaml offboarding rook [puppet] - 10https://gerrit.wikimedia.org/r/1127923 (owner: 10Slyngshede) [15:44:39] (03PS3) 10Volans: interactive: notify when waiting for input [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125953 [15:44:39] (03PS3) 10Volans: tests: remove unnecessary vulture setting [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125954 [15:44:56] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10636593 (10Jhancock.wm) so fun update. after i replaced the drives i tried to reimage. it kept failing at partitioning the os drive. got papaul involved. found that the 1st os d... [15:47:11] (03CR) 10JMeybohm: [C:03+2] k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1127922 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [15:48:47] (03PS4) 10Volans: interactive: notify when waiting for input [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125953 [15:48:47] (03PS4) 10Volans: tests: remove unnecessary vulture setting [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125954 [15:49:01] (03CR) 10Volans: interactive: notify when waiting for input (033 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125953 (owner: 10Volans) [15:50:56] !log slyngshede@cumin1002 DONE (FAIL) - Cookbook sre.idm.logout (exit_code=99) Logging Vivian Rook out of all services on: 2288 hosts [15:51:42] (03PS2) 10Vgutierrez: lists: Offer RSA+ECDSA certificates on lists.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/1127066 (https://phabricator.wikimedia.org/T385067) [15:53:19] (03CR) 10JHathaway: [C:03+1] lists: Offer RSA+ECDSA certificates on lists.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/1127066 (https://phabricator.wikimedia.org/T385067) (owner: 10Vgutierrez) [15:54:30] (03PS1) 10JMeybohm: Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1127924 [15:58:47] (03CR) 10JMeybohm: [C:03+2] Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1127924 (owner: 10JMeybohm) [16:00:27] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin depooling A:liberica-canary [16:00:47] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling A:liberica-canary [16:00:55] 🍿 [16:00:58] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin pooling A:liberica-canary [16:01:09] put extra butter [16:01:13] Kerrygold ideally [16:01:14] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.admin (exit_code=1) pooling A:liberica-canary [16:01:27] uh.. it crashed on the repool [16:01:33] yeah [16:01:37] Mismatched message, not enabling puppet. [16:01:47] but that's because Puppet was aleady disabled [16:01:50] The last Puppet run was at Tue Mar 11 15:39:32 UTC 2025 (4342 minutes ago). Puppet is disabled. vgutierrez - vgutierrez [16:01:51] The last Puppet run was at Tue Mar 11 15:39:32 UTC 2025 (4341 minutes ago). Puppet is disabled. vgutierrez - vgutierrez [16:01:51] LOL [16:01:55] * vgutierrez hides [16:02:16] that probably answers the other question I had, related to the upgrade one [16:02:19] can I enable it? [16:02:23] sukhe: yes [16:02:46] (03CR) 10Dzahn: [C:03+1] nftables: add a newline at the end of GERRIT_ABUSERS_ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [16:03:30] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin pooling A:liberica-canary [16:03:38] (03CR) 10Kamila Součková: "CHECK IT OUT IT'S GREEN +1 IT BEFORE ANYONE NOTICES THE MESS! :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [16:04:12] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.admin (exit_code=1) pooling A:liberica-canary [16:04:40] ^ that's fine though, it was already pooled. doing one more round [16:04:48] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin depooling A:liberica-canary [16:04:56] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling A:liberica-canary [16:05:07] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin pooling A:liberica-canary [16:05:23] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.admin (exit_code=1) pooling A:liberica-canary [16:05:41] ok yeah [16:05:47] so what I messed up? [16:05:56] not you :) we all did [16:06:00] > depool lvs1013.eqiad.wmnet: testing depool [16:06:05] mismatched puppet message [16:06:39] so I can't use the reason as the puppet message [16:07:06] reason = f"{self._args.action} {hosts}: {self._args.reason}" [16:07:26] yeah.. that's gonna be pool|depool [16:07:31] it will never match [16:07:34] nice catch :D [16:07:47] so we missed this in the review [16:09:05] don't forget to re-enable puppet there [16:09:05] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387829#10636722 (10Jhancock.wm) yeah i can do that. I'll need some downtime for reboots. what would be the best time for that next week? shouldn't need more than 2 hours [16:09:14] right, doing it [16:14:26] !log slyngshede@cumin1002 DONE (FAIL) - Cookbook sre.idm.logout (exit_code=99) Logging Vivian Rook out of all services on: 2288 hosts [16:15:09] (03PS2) 10Gergő Tisza: Fix some SUL3 shared domain settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127648 (https://phabricator.wikimedia.org/T388218) [16:15:32] (03CR) 10Gergő Tisza: "Oh, right, wrong task ID." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127648 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza) [16:15:44] !log slyngshede@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Vivian Rook out of all services on: 2288 hosts [16:15:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127648 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza) [16:15:59] (03PS9) 10Kamila Součková: shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) [16:20:46] (03CR) 10CI reject: [V:04-1] shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [16:22:05] (03PS2) 10Jdlrobson: Remove obsolete $wgRelatedArticlesLoggingBucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127900 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [16:22:42] (03PS2) 10Jdlrobson: Remove obsolete $wgPopupsEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127897 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [16:23:20] (03PS2) 10Jdlrobson: Remove obsolete $wgMinervaApplyKnownTemplateHacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127887 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [16:23:26] (03PS2) 10Jdlrobson: Remove obsolete $wgPopupsOptInStateForNewAccounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127898 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [16:25:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10636828 (10phaultfinder) [16:27:40] (03PS1) 10Jdlrobson: Remove unnecessary boolean statement for $wmgIncreaseDefaultVectorFontSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127929 (https://phabricator.wikimedia.org/T388905) [16:27:42] (03PS1) 10Jdlrobson: Remove A/B test enrollment flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127930 (https://phabricator.wikimedia.org/T388905) [16:31:29] (03PS2) 10Jdlrobson: Remove unnecessary boolean statement for $wmgIncreaseDefaultVectorFontSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127929 (https://phabricator.wikimedia.org/T388905) [16:31:37] (03PS2) 10Jdlrobson: Remove A/B test enrollment flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127930 (https://phabricator.wikimedia.org/T388905) [16:38:52] (03CR) 10Dzahn: [C:03+2] gerrit: group similare prefixes under gerrit_abusers [puppet] - 10https://gerrit.wikimedia.org/r/1127875 (owner: 10Hashar) [16:41:15] (03PS1) 10Vgutierrez: exim: Use RSA+ECDSA certificates for lists [puppet] - 10https://gerrit.wikimedia.org/r/1127933 (https://phabricator.wikimedia.org/T385067) [16:42:38] (03CR) 10Dzahn: [C:03+2] gerrit: move ByteDance blocks from Apache to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1127876 (https://phabricator.wikimedia.org/T375996) (owner: 10Hashar) [16:44:31] (03PS10) 10Kamila Součková: shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) [16:46:20] (03PS4) 10Dzahn: nftables: add a newline at the end of GERRIT_ABUSERS lines [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [16:47:30] (03PS3) 10Scott French: deployment_server: Support PHP version selection in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1126697 (https://phabricator.wikimedia.org/T387917) [16:47:42] (03CR) 10JHathaway: [C:03+1] exim: Use RSA+ECDSA certificates for lists [puppet] - 10https://gerrit.wikimedia.org/r/1127933 (https://phabricator.wikimedia.org/T385067) (owner: 10Vgutierrez) [16:48:24] (03CR) 10Dzahn: "I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1127876 and since that adds an IPv6 address for the first time it showed the" [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [16:50:48] (03CR) 10Scott French: "Thanks for the review! Just tested this out (works as expected) and I'll plan to merge on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/1126697 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [16:50:58] (03PS1) 10Dzahn: Revert "gerrit: move ByteDance blocks from Apache to firewall" [puppet] - 10https://gerrit.wikimedia.org/r/1127936 [16:50:58] (03CR) 10CI reject: [V:04-1] shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [16:51:28] (03CR) 10Dzahn: [C:03+2] "also see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1127527 but manually adding it did not fix it for me and I have a meeting no" [puppet] - 10https://gerrit.wikimedia.org/r/1127936 (owner: 10Dzahn) [16:51:43] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [16:51:44] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [16:52:24] (03CR) 10Dzahn: [V:03+2 C:03+2] Revert "gerrit: move ByteDance blocks from Apache to firewall" [puppet] - 10https://gerrit.wikimedia.org/r/1127936 (owner: 10Dzahn) [16:58:21] (03PS1) 10Vgutierrez: sre.loadbalancer.admin: Use same reason for disabling/enabling puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/1127938 (https://phabricator.wikimedia.org/T388369) [16:59:50] (03PS1) 10Aaron Schulz: Revert "Temporary revert changeprop/changeprop-jobqueue to node 18 images" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127939 [17:00:19] (03PS11) 10Kamila Součková: shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) [17:00:55] (03CR) 10Ssingh: [C:03+1] sre.loadbalancer.admin: Use same reason for disabling/enabling puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/1127938 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [17:05:35] (03CR) 10Vgutierrez: [C:03+2] sre.loadbalancer.admin: Use same reason for disabling/enabling puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/1127938 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [17:06:33] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10637026 (10fnegri) a:05fnegri→03Andrew I did some more research on how to depool this host, and found there's an additional tip here: https://... [17:11:55] (03Merged) 10jenkins-bot: sre.loadbalancer.admin: Use same reason for disabling/enabling puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/1127938 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [17:12:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10637037 (10phaultfinder) [17:19:02] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10637052 (10bd808) >>! In T383723#10637026, @fnegri wrote: > Finally, I don't have a clear picture of which clients are connecting to clouddumps ho... [17:39:38] (03PS1) 10DLynch: Enable VisualEditor EditCheck multi-check a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127945 (https://phabricator.wikimedia.org/T384372) [17:47:34] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10637137 (10BTullis) clouddumps1001 is also mounted from an-launcher1002, and this path is used for running a few DAGs. I also found this referenc... [17:50:33] (03PS1) 10Kamila Součková: services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127947 (https://phabricator.wikimedia.org/T388390) [17:50:34] (03PS1) 10Kamila Součková: ml-services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127948 (https://phabricator.wikimedia.org/T388390) [17:50:36] (03PS1) 10Kamila Součková: dse-k8s-services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127949 (https://phabricator.wikimedia.org/T388390) [17:50:38] (03PS1) 10Kamila Součková: aux-k8s-services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127950 (https://phabricator.wikimedia.org/T388390) [17:55:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10637200 (10phaultfinder) [18:02:08] (03CR) 10CI reject: [V:04-1] services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127947 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [18:03:21] (03PS1) 10Gergő Tisza: Try both SUL2 and SUL3 central domain for autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127952 (https://phabricator.wikimedia.org/T375796) [18:03:36] (03CR) 10CI reject: [V:04-1] dse-k8s-services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127949 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [18:04:10] (03CR) 10CI reject: [V:04-1] aux-k8s-services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127950 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [18:04:18] (03CR) 10CI reject: [V:04-1] ml-services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127948 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [18:05:30] (03PS1) 10Dzahn: gerrit: move ByteDance (and only this) block from Apache to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1127953 (https://phabricator.wikimedia.org/T375996) [18:06:12] (03PS2) 10Dzahn: gerrit: move ByteDance (and only this) block from Apache to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1127953 (https://phabricator.wikimedia.org/T375996) [18:14:33] (03PS1) 10Gergő Tisza: Enable SUL3 logins on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127954 (https://phabricator.wikimedia.org/T384153) [18:14:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127954 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [18:15:52] (03CR) 10Dzahn: [C:03+2] gerrit: move ByteDance (and only this) block from Apache to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1127953 (https://phabricator.wikimedia.org/T375996) (owner: 10Dzahn) [18:32:32] (03CR) 10Dzahn: [C:03+2] "I did https://gerrit.wikimedia.org/r/c/operations/puppet/+/1127953 so ByteDance is moved just that CacheFly IPv6 address is not." [puppet] - 10https://gerrit.wikimedia.org/r/1127876 (https://phabricator.wikimedia.org/T375996) (owner: 10Hashar) [19:00:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10637329 (10phaultfinder) [19:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:10:58] (03PS1) 10Ladsgroup: [WIP] MetaContactPages: Add affcom conflict reporting page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127958 (https://phabricator.wikimedia.org/T388919) [19:15:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10637383 (10phaultfinder) [19:18:22] (03PS1) 10Tchanders: Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) [19:31:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10637422 (10phaultfinder) [20:22:50] (03PS12) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) [20:23:01] (03PS1) 10Gergő Tisza: Enable credentials change special pages on SUL3 shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127965 (https://phabricator.wikimedia.org/T362715) [20:23:14] (03CR) 10CI reject: [V:04-1] grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [20:23:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127952 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza) [20:24:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127965 (https://phabricator.wikimedia.org/T362715) (owner: 10Gergő Tisza) [20:24:35] (03PS13) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) [20:26:21] (03CR) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [20:26:23] (03CR) 10Jdlrobson: [C:03+1] Remove obsolete $wgRelatedArticlesLoggingBucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127900 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [20:27:31] (03CR) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [20:27:38] (03CR) 10Jdlrobson: [C:03+1] Remove obsolete $wgPopupsOptInStateForNewAccounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127898 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [20:27:42] (03CR) 10Jdlrobson: [C:03+1] Remove obsolete $wgMinervaApplyKnownTemplateHacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127887 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [20:27:45] (03CR) 10Jdlrobson: [C:03+1] Remove obsolete $wgPopupsEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127897 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [20:31:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10637555 (10phaultfinder) [20:45:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10637664 (10phaultfinder) [20:47:44] (03PS1) 10Bvibber: Re-enable wgTrackGlobalJsonLinksNamespaces for JsonConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127976 (https://phabricator.wikimedia.org/T385917) [20:48:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [20:49:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127976 (https://phabricator.wikimedia.org/T385917) (owner: 10Bvibber) [21:02:57] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [21:06:44] !log zabe@mwmaint2002:~$ mwscript extensions/AbuseFilter/maintenance/MigrateESRefToAflTable.php testwiki --dump /home/zabe/afl_text_table_dump/testwiki --deletedump /home/zabe/afl_text_table_deletedump/testwiki --sleep 0.3 # T381599 [21:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:48] T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599 [21:15:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10637811 (10phaultfinder) [21:25:56] !log zabe@mwmaint2002:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php testwiki --delete /home/zabe/afl_text_table_deletedump/testwiki --sleep 0.3 # T381599 [21:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:00] T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599 [21:59:36] (03CR) 10Ecarg: [C:03+2] Revert "wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127057 (https://phabricator.wikimedia.org/T385859) (owner: 10Jforrester) [22:01:35] (03Merged) 10jenkins-bot: Revert "wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127057 (https://phabricator.wikimedia.org/T385859) (owner: 10Jforrester) [22:09:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10637970 (10phaultfinder) [22:44:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638107 (10phaultfinder) [23:02:51] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10638118 (10BCornwall) @ATsay-WMF Do you approve of this? Thanks! [23:04:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638119 (10phaultfinder) [23:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:15:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10638127 (10phaultfinder) [23:34:55] (03CR) 10Catrope: [C:03+1] Re-enable wgTrackGlobalJsonLinksNamespaces for JsonConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127976 (https://phabricator.wikimedia.org/T385917) (owner: 10Bvibber)