[00:03:23] (03CR) 10Dzahn: [V:03+1 C:03+2] "like this the diff is just some inconsistencies about "status_matches" but the default value should be 200." [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi) [00:04:30] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:04:54] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:06:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:06:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:08:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166943 [00:08:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166943 (owner: 10TrainBranchBot) [00:12:53] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop on people* and alert*" [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi) [00:15:52] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Hiera#Puppet_enc_system" [puppet] - 10https://gerrit.wikimedia.org/r/1166263 (https://phabricator.wikimedia.org/T396936) (owner: 10BryanDavis) [00:16:56] (03CR) 10Dzahn: [C:03+2] "https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Hiera#Puppet_enc_system" [puppet] - 10https://gerrit.wikimedia.org/r/1166262 (https://phabricator.wikimedia.org/T397591) (owner: 10BryanDavis) [00:19:25] (03CR) 10Dzahn: [C:03+2] gitlab: Allow WMCS runners to talk to deployment-prep wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166262 (https://phabricator.wikimedia.org/T397591) (owner: 10BryanDavis) [00:20:04] (03PS3) 10BryanDavis: gitlab: Allow WMCS runners to talk to puppet-enc.cloudinfra [puppet] - 10https://gerrit.wikimedia.org/r/1166263 (https://phabricator.wikimedia.org/T396936) [00:20:39] !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [00:21:03] !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [00:21:15] (03CR) 10Dzahn: [C:03+2] gitlab: Allow WMCS runners to talk to puppet-enc.cloudinfra [puppet] - 10https://gerrit.wikimedia.org/r/1166263 (https://phabricator.wikimedia.org/T396936) (owner: 10BryanDavis) [00:33:48] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166943 (owner: 10TrainBranchBot) [01:08:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.9 [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1166950 (https://phabricator.wikimedia.org/T392179) [01:08:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.9 [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1166950 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [01:19:27] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.9 [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1166950 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [01:42:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:57:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:59:35] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0200) [02:20:19] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [02:21:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:23:36] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [02:42:12] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0300) [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0400) [04:04:28] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.6 (duration: 04m 24s) [04:06:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:06:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:14:34] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 is not booting up - https://phabricator.wikimedia.org/T398794#10981837 (10Marostegui) @VRiley-WMF from our side the host is fine. If you or @Jclark-ctr need to work on upgrade firmwares and BIOS, please let me know so I can depool it and have it ready for it. [04:17:41] (03PS1) 10Marostegui: db1237: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1166959 (https://phabricator.wikimedia.org/T397279) [04:18:15] (03CR) 10Marostegui: [C:03+2] db1237: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1166959 (https://phabricator.wikimedia.org/T397279) (owner: 10Marostegui) [04:23:47] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1222 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1166960 (https://phabricator.wikimedia.org/T398906) [04:23:51] (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1166961 (https://phabricator.wikimedia.org/T398906) [04:26:20] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s2 T398906 [04:26:23] T398906: Switchover s2 master (db1162 -> db1222) - https://phabricator.wikimedia.org/T398906 [04:26:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1222 with weight 0 T398906', diff saved to https://phabricator.wikimedia.org/P78780 and previous config saved to /var/cache/conftool/dbconfig/20250708-042646-root.json [04:31:20] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1222 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1166960 (https://phabricator.wikimedia.org/T398906) (owner: 10Gerrit maintenance bot) [04:33:59] (03PS3) 10KartikMistry: machinetranslation: Use s3 for model download in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166543 (https://phabricator.wikimedia.org/T335491) [04:36:15] !log Starting s2 eqiad failover from db1162 to db1222 - T398906 [04:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:18] T398906: Switchover s2 master (db1162 -> db1222) - https://phabricator.wikimedia.org/T398906 [04:36:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T398906', diff saved to https://phabricator.wikimedia.org/P78781 and previous config saved to /var/cache/conftool/dbconfig/20250708-043628-root.json [04:36:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1222 to s2 primary and set section read-write T398906', diff saved to https://phabricator.wikimedia.org/P78782 and previous config saved to /var/cache/conftool/dbconfig/20250708-043654-root.json [04:37:20] !log marostegui@dns1006 START - running authdns-update [04:37:30] (03CR) 10Marostegui: [C:03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1166961 (https://phabricator.wikimedia.org/T398906) (owner: 10Gerrit maintenance bot) [04:38:04] !log marostegui@dns1006 END - running authdns-update [04:38:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1162 T398906', diff saved to https://phabricator.wikimedia.org/P78783 and previous config saved to /var/cache/conftool/dbconfig/20250708-043814-marostegui.json [04:38:36] !log marostegui@dns1006 START - running authdns-update [04:39:23] !log marostegui@dns1006 END - running authdns-update [04:40:31] (03PS1) 10Marostegui: db1162: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1166963 (https://phabricator.wikimedia.org/T396549) [04:41:00] (03CR) 10Marostegui: [C:03+2] db1162: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1166963 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [04:47:20] PROBLEM - Host an-worker1095 is DOWN: PING CRITICAL - Packet loss = 100% [04:47:48] (03PS1) 10Marostegui: db1237: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1166964 [04:48:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P78784 and previous config saved to /var/cache/conftool/dbconfig/20250708-044803-root.json [04:51:54] (03CR) 10Marostegui: [C:03+2] db1237: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1166964 (owner: 10Marostegui) [04:58:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78785 and previous config saved to /var/cache/conftool/dbconfig/20250708-045812-root.json [05:03:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78786 and previous config saved to /var/cache/conftool/dbconfig/20250708-050308-root.json [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:13:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78787 and previous config saved to /var/cache/conftool/dbconfig/20250708-051318-root.json [05:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:18:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78788 and previous config saved to /var/cache/conftool/dbconfig/20250708-051814-root.json [05:28:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78789 and previous config saved to /var/cache/conftool/dbconfig/20250708-052823-root.json [05:33:10] (03PS1) 10Giuseppe Lavagetto: Stop loggging requests that would not be rate-limited [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166967 [05:33:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78790 and previous config saved to /var/cache/conftool/dbconfig/20250708-053320-root.json [05:33:28] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Stop loggging requests that would not be rate-limited [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166967 (owner: 10Giuseppe Lavagetto) [05:33:40] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gerrit2003.wikimedia.org with reason: WIP [05:35:14] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Feature: better logging of varnish rate-limits - oblivian@cumin1003" [05:35:15] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Feature: better logging of varnish rate-limits - oblivian@cumin1003 [05:35:47] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Feature: better logging of varnish rate-limits - oblivian@cumin1003 [05:35:48] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Feature: better logging of varnish rate-limits - oblivian@cumin1003" [05:41:35] (03PS1) 10Giuseppe Lavagetto: Revert "Stop loggging requests that would not be rate-limited" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166968 [05:41:42] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Revert "Stop loggging requests that would not be rate-limited" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166968 (owner: 10Giuseppe Lavagetto) [05:41:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:41:58] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Reverty - oblivian@cumin1003" [05:41:59] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Reverty - oblivian@cumin1003 [05:42:28] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Reverty - oblivian@cumin1003 [05:42:29] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Reverty - oblivian@cumin1003" [05:42:36] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Reverty - oblivian@cumin1003" [05:42:37] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Reverty - oblivian@cumin1003 [05:43:04] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Reverty - oblivian@cumin1003 [05:43:06] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Reverty - oblivian@cumin1003" [05:43:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78791 and previous config saved to /var/cache/conftool/dbconfig/20250708-054329-root.json [05:48:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78792 and previous config saved to /var/cache/conftool/dbconfig/20250708-054825-root.json [05:50:49] (03PS1) 10Marostegui: s3 codfw: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1166969 (https://phabricator.wikimedia.org/T383795) [05:51:23] (03CR) 10Marostegui: "This is a NOOP until the change is made lively on the hosts (or mariadb is restarted)" [puppet] - 10https://gerrit.wikimedia.org/r/1166969 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [05:51:27] (03CR) 10Marostegui: [C:03+2] s3 codfw: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1166969 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [05:51:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:52:43] !log Migrate s3 codfw to SBR T383795 [05:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:46] T383795: Move sX to STATEMENT based replication - https://phabricator.wikimedia.org/T383795 [05:53:08] (03CR) 10Arnaudb: [C:03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0600) [06:00:05] marostegui, Amir1, and federico3: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0600). [06:13:21] (03PS1) 10Giuseppe Lavagetto: Fix varnish logging of rate-limiting, take 2 [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166970 [06:13:34] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Fix varnish logging of rate-limiting, take 2 [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166970 (owner: 10Giuseppe Lavagetto) [06:14:27] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Fix varnis logging (take 2) - oblivian@cumin1003" [06:14:28] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix varnis logging (take 2) - oblivian@cumin1003 [06:14:58] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix varnis logging (take 2) - oblivian@cumin1003 [06:15:00] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Fix varnis logging (take 2) - oblivian@cumin1003" [06:16:20] PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [06:19:25] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [06:21:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:00] (03PS1) 10Giuseppe Lavagetto: Revert "Fix varnish logging of rate-limiting, take 2" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167078 [06:30:29] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Revert "Fix varnish logging of rate-limiting, take 2" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167078 (owner: 10Giuseppe Lavagetto) [06:30:50] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Revert - oblivian@cumin1003" [06:30:51] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Revert - oblivian@cumin1003 [06:31:22] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Revert - oblivian@cumin1003 [06:31:23] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Revert - oblivian@cumin1003" [06:35:47] !log rebalance following reimages T382513 [06:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:49] T382513: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513 [06:36:38] (03PS1) 10Giuseppe Lavagetto: Revert logging changes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167079 [06:38:26] (03CR) 10Elukey: pyrra: remove multi-dc for istio-based SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [06:38:39] (03CR) 10Vgutierrez: [C:03+2] hiera: Remove esams and magru bgp peer overrides [puppet] - 10https://gerrit.wikimedia.org/r/1166870 (owner: 10Vgutierrez) [06:42:09] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167080 [06:50:26] (03CR) 10Filippo Giunchedi: "How did you pick 5m ? The current puppet runs on alert hosts take ~3m so 5m would mean puppet-agent basically running all the time, is tha" [puppet] - 10https://gerrit.wikimedia.org/r/1166846 (https://phabricator.wikimedia.org/T398444) (owner: 10Herron) [06:56:15] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167080 (owner: 10PipelineBot) [06:58:09] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167080 (owner: 10PipelineBot) [07:00:04] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0700). [07:00:04] Tchanders: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:52] o/ [07:00:57] I'll deploy my own patch [07:01:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:01:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166791 (https://phabricator.wikimedia.org/T381845) (owner: 10Tchanders) [07:02:06] (03PS1) 10Volans: tox.ini: skip Python 3.10 in CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167081 [07:02:25] (03PS2) 10Volans: cookbook API: simplify -t/--task-id support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1154787 [07:02:25] (03CR) 10Volans: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1154787 (owner: 10Volans) [07:03:06] (03CR) 10Nikerabbit: [C:03+1] CX: Add virtual-cx-shared DatabaseVirtualDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [07:03:18] (03Merged) 10jenkins-bot: temp accounts: Separate digits in user names with hyphens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166791 (https://phabricator.wikimedia.org/T381845) (owner: 10Tchanders) [07:03:42] !log tchanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1166791|temp accounts: Separate digits in user names with hyphens (T381845)]] [07:03:44] T381845: Add hyphens to break temporary user names into groups of <5 digits - https://phabricator.wikimedia.org/T381845 [07:05:48] !log tchanders@deploy1003 tchanders: Backport for [[gerrit:1166791|temp accounts: Separate digits in user names with hyphens (T381845)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:06:27] RESOLVED: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:09:18] !log tchanders@deploy1003 tchanders: Continuing with sync [07:14:44] !log tchanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166791|temp accounts: Separate digits in user names with hyphens (T381845)]] (duration: 11m 02s) [07:14:48] T381845: Add hyphens to break temporary user names into groups of <5 digits - https://phabricator.wikimedia.org/T381845 [07:17:13] My patch is done, but I won't log that the window is done, in case anyone else wants to deploy something in the next 40 minutes [07:19:28] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 18.0 [07:22:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:22:35] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [07:26:26] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:30:32] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 18.0 [07:30:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10982110 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done. [07:32:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:36:41] (03CR) 10Gmodena: [C:03+2] services: mw-page-content-change-enrich: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166923 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [07:38:17] (03Merged) 10jenkins-bot: services: mw-page-content-change-enrich: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166923 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [07:42:06] (03CR) 10Fabfur: [C:03+2] cache: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [07:42:14] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [07:42:36] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [07:45:12] !log temporary disable puppet on A:cp to apply https://gerrit.wikimedia.org/r/1135643 (T329332) [07:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:52:00] (03PS1) 10Marostegui: s3 eqiad: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1167142 (https://phabricator.wikimedia.org/T383795) [07:52:19] (03PS1) 10Vgutierrez: hiera: Issue dedicated certs for probenet endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1167143 (https://phabricator.wikimedia.org/T398596) [07:53:03] (03CR) 10Marostegui: "This is a NOOP until the change is made lively on the databases or we restart mariadb" [puppet] - 10https://gerrit.wikimedia.org/r/1167142 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [07:53:07] (03CR) 10Marostegui: [C:03+2] s3 eqiad: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1167142 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [07:53:53] (03PS2) 10Vgutierrez: hiera: Issue dedicated certs for probenet endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1167143 (https://phabricator.wikimedia.org/T398596) [07:54:21] !log Migrate s3 eqiad to SBR T383795 [07:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:24] T383795: Move sX to STATEMENT based replication - https://phabricator.wikimedia.org/T383795 [07:55:27] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167143 (https://phabricator.wikimedia.org/T398596) (owner: 10Vgutierrez) [07:55:28] (03CR) 10Klausman: [C:03+1] machinetranslation: Use s3 for model download in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166543 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [07:55:48] !log enabling puppet on A:cp (T329332) [07:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:45] (03CR) 10Jgiannelos: [C:04-1] "Overall other than the kafka topic, it looks OK." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165550 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:00:05] andre and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0800). [08:00:36] (03CR) 10Jgiannelos: [C:04-1] services: configure tegola in codfw to use maps-test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165550 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:00:54] 06SRE, 06Traffic: Benthos - remove the kafka output module - https://phabricator.wikimedia.org/T398916 (10Fabfur) 03NEW [08:01:58] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [08:02:14] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:06:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:06:47] !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [08:06:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:06:57] !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:10:38] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3585 MB (3% inode=98%): /tmp 3585 MB (3% inode=98%): /var/tmp 3585 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [08:11:21] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [08:11:34] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [08:11:42] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:11:45] !log installing postgresql-15 security updates [08:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:48] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167144 (https://phabricator.wikimedia.org/T392179) [08:14:50] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167144 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [08:15:51] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167144 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [08:16:17] !log aklapper@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.9 refs T392179 [08:16:21] T392179: 1.45.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T392179 [08:17:41] (03PS9) 10Fabfur: varnish: replace X-Public-Cloud with new X-Provenance header check [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) [08:21:26] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:51] (03CR) 10Gmodena: [C:03+2] dse: mw-content-history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166921 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [08:22:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:23:29] (03Merged) 10jenkins-bot: dse: mw-content-history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166921 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [08:26:00] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167143 (https://phabricator.wikimedia.org/T398596) (owner: 10Vgutierrez) [08:26:11] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [08:26:18] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [08:28:30] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [08:28:54] (03PS1) 10Tiziano Fogli: Review access change [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1167145 [08:30:07] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [08:30:23] !log created a stub user "bumpuid" to move the allocation of UIDs for accounted created in Wikimedia IDM to 100000+ T355663 [08:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:26] T355663: Allocate more available UNIX UIDs for human users - https://phabricator.wikimedia.org/T355663 [08:30:35] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [08:30:44] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [08:35:04] (03PS9) 10Btullis: Add the new cephosd200[1-3] servers in codfw to their role [puppet] - 10https://gerrit.wikimedia.org/r/1166866 (https://phabricator.wikimedia.org/T374923) [08:36:50] (03PS3) 10Ladsgroup: tables-catalog: Mark vision to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1166854 (https://phabricator.wikimedia.org/T363581) [08:36:56] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Mark vision to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1166854 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [08:38:12] (03CR) 10Vgutierrez: "looks good,added some inline comments about discrepancies between regex in requestcl and here and a suggestion about how to improve one of" [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [08:39:10] (03PS1) 10Ladsgroup: Fully get rid of tracking and updating pages [extensions/FlaggedRevs] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167148 (https://phabricator.wikimedia.org/T398033) [08:39:22] (03PS1) 10Ladsgroup: Fully get rid of tracking and updating pages [extensions/FlaggedRevs] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167149 (https://phabricator.wikimedia.org/T398033) [08:39:34] (03Abandoned) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [08:39:42] jouncebot: nowandnext [08:39:42] For the next 1 hour(s) and 20 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0800) [08:39:42] In 1 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1000) [08:40:01] (03PS1) 10Hashar: Review access change [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1167150 [08:40:27] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10982431 (10MoritzMuehlenhoff) [08:40:45] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10982433 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done [08:42:20] (03PS2) 10Hashar: Remove specific force push to refs/sandbox/* branches [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1167150 (https://phabricator.wikimedia.org/T398921) [08:43:51] (03Abandoned) 10Tiziano Fogli: Review access change [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1167145 (owner: 10Tiziano Fogli) [08:44:07] (03CR) 10Hashar: [V:03+2 C:03+2] Remove specific force push to refs/sandbox/* branches [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1167150 (https://phabricator.wikimedia.org/T398921) (owner: 10Hashar) [08:45:17] (03CR) 10Brouberol: Add the new cephosd200[1-3] servers in codfw to their role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166866 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [08:46:14] (03Abandoned) 10Majavah: Ensure service catalog schema matches spicerack release [puppet] - 10https://gerrit.wikimedia.org/r/931241 (https://phabricator.wikimedia.org/T339243) (owner: 10Majavah) [08:48:06] (03CR) 10CI reject: [V:04-1] Fully get rid of tracking and updating pages [extensions/FlaggedRevs] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167149 (https://phabricator.wikimedia.org/T398033) (owner: 10Ladsgroup) [08:48:09] !log installing Redis security updates [08:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:01] (03PS3) 10Majavah: Remove l10nupdate manifests [puppet] - 10https://gerrit.wikimedia.org/r/928582 [08:50:27] (03CR) 10Majavah: "found this while cleaning up my puppet.git clone.. this still looks relevant?" [puppet] - 10https://gerrit.wikimedia.org/r/928582 (owner: 10Majavah) [08:52:50] !log installing nginx security updates [08:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:02] (03Abandoned) 10Majavah: openstack::util::patch: add define [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro) [08:54:15] (03Abandoned) 10Majavah: P:toolforge::grid: add bash completion to exec-manage [puppet] - 10https://gerrit.wikimedia.org/r/815780 (owner: 10Majavah) [08:55:13] (03Abandoned) 10Majavah: aptrepo: cleanup haproxy update and component names [puppet] - 10https://gerrit.wikimedia.org/r/969819 (owner: 10Majavah) [08:56:12] (03PS1) 10Vgutierrez: varnish: Prevent unknown clients from reaching /evt-103e/v2/events [puppet] - 10https://gerrit.wikimedia.org/r/1167151 (https://phabricator.wikimedia.org/T398181) [08:58:05] (03Abandoned) 10Majavah: kerberos: manage users with custom puppet type [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [08:59:15] (03CR) 10Btullis: [V:03+1] Add the new cephosd200[1-3] servers in codfw to their role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166866 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [08:59:35] !log aklapper@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.9 refs T392179 (duration: 43m 18s) [08:59:38] T392179: 1.45.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T392179 [09:02:46] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167152 (https://phabricator.wikimedia.org/T392179) [09:02:47] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167152 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [09:03:47] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167152 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [09:04:38] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling reboot on A:schema-eqiad [09:08:13] (03PS2) 10Vgutierrez: varnish: Prevent unknown clients from reaching /evt-103e/v2/events [puppet] - 10https://gerrit.wikimedia.org/r/1167151 (https://phabricator.wikimedia.org/T398181) [09:12:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling reboot on A:schema-eqiad [09:15:10] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:15:25] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.9 refs T392179 [09:15:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:15:30] T392179: 1.45.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T392179 [09:17:00] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.233 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:17:18] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54224 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:18:53] (03PS3) 10Vgutierrez: varnish: Prevent unknown clients from reaching /evt-103e/v2/events [puppet] - 10https://gerrit.wikimedia.org/r/1167151 (https://phabricator.wikimedia.org/T398181) [09:18:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10982612 (10cmooney) 05Open→03Resolved >>! In T398598#10980766, @Jhancock.wm wrote: > reset the tripped breaker in D3. On... [09:19:11] I had a quick look at lists1004, nothing out of the ordinary [09:19:26] If this croaks again in the day I 'll have a more serious look [09:19:54] (03CR) 10Vgutierrez: "varnishtests are happy: `0 tests failed, 0 tests skipped, 39 tests passed`" [puppet] - 10https://gerrit.wikimedia.org/r/1167151 (https://phabricator.wikimedia.org/T398181) (owner: 10Vgutierrez) [09:21:53] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2329.mgmt:22 - https://phabricator.wikimedia.org/T398559#10982616 (10cmooney) 05Open→03Resolved a:03cmooney [09:21:55] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for bast2003.mgmt:22 - https://phabricator.wikimedia.org/T398557#10982619 (10cmooney) 05Open→03Resolved a:03cmooney [09:22:02] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2219.mgmt:22 - https://phabricator.wikimedia.org/T398556#10982622 (10cmooney) 05Open→03Resolved a:03cmooney [09:22:19] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2181.mgmt:22 - https://phabricator.wikimedia.org/T398573#10982625 (10cmooney) 05Open→03Resolved a:03cmooney [09:22:27] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for aux-k8s-worker2009.mgmt:22 - https://phabricator.wikimedia.org/T398572#10982628 (10cmooney) 05Open→03Resolved a:03cmooney [09:22:33] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2213.mgmt:22 - https://phabricator.wikimedia.org/T398571#10982631 (10cmooney) 05Open→03Resolved a:03cmooney [09:22:41] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for es2040.mgmt:22 - https://phabricator.wikimedia.org/T398570#10982634 (10cmooney) 05Open→03Resolved a:03cmooney [09:22:48] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for es2044.mgmt:22 - https://phabricator.wikimedia.org/T398569#10982637 (10cmooney) 05Open→03Resolved a:03cmooney [09:22:54] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2182.mgmt:22 - https://phabricator.wikimedia.org/T398568#10982640 (10cmooney) 05Open→03Resolved a:03cmooney [09:23:02] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for puppetdb2003.mgmt:22 - https://phabricator.wikimedia.org/T398567#10982643 (10cmooney) 05Open→03Resolved a:03cmooney [09:23:15] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2173.mgmt:22 - https://phabricator.wikimedia.org/T398565#10982646 (10cmooney) 05Open→03Resolved a:03cmooney [09:23:23] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2217.mgmt:22 - https://phabricator.wikimedia.org/T398564#10982649 (10cmooney) 05Open→03Resolved a:03cmooney [09:23:29] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2330.mgmt:22 - https://phabricator.wikimedia.org/T398563#10982652 (10cmooney) 05Open→03Resolved a:03cmooney [09:23:37] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2320.mgmt:22 - https://phabricator.wikimedia.org/T398562#10982655 (10cmooney) 05Open→03Resolved a:03cmooney [09:23:44] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2201.mgmt:22 - https://phabricator.wikimedia.org/T398561#10982658 (10cmooney) 05Open→03Resolved a:03cmooney [09:23:52] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for pc2016.mgmt:22 - https://phabricator.wikimedia.org/T398560#10982661 (10cmooney) 05Open→03Resolved a:03cmooney [09:30:38] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3437 MB (3% inode=98%): /tmp 3437 MB (3% inode=98%): /var/tmp 3437 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [09:38:46] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167158 [09:39:17] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167158 (owner: 10PipelineBot) [09:40:40] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10982781 (10MoritzMuehlenhoff) [09:40:50] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167158 (owner: 10PipelineBot) [09:41:08] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [09:41:34] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [09:44:41] (03PS1) 10Ladsgroup: tables-catalog: Temporarily set categorylinks to partially public [puppet] - 10https://gerrit.wikimedia.org/r/1167159 (https://phabricator.wikimedia.org/T299951) [09:45:39] (03PS1) 10Jgiannelos: pcs: Enable profiler on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167160 [09:46:34] (03CR) 10Clément Goubert: [C:03+1] api-gateway: use ratelimit's inbuilt promethus-statsd agent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166790 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan) [09:46:46] (03PS2) 10Jgiannelos: pcs: Enable profiler on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167160 [09:47:21] (03CR) 10Hnowlan: [C:03+1] pcs: Enable profiler on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167160 (owner: 10Jgiannelos) [09:47:31] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Temporarily set categorylinks to partially public [puppet] - 10https://gerrit.wikimedia.org/r/1167159 (https://phabricator.wikimedia.org/T299951) (owner: 10Ladsgroup) [09:48:05] (03CR) 10Jgiannelos: [C:03+2] pcs: Enable profiler on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167160 (owner: 10Jgiannelos) [09:50:02] (03Merged) 10jenkins-bot: pcs: Enable profiler on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167160 (owner: 10Jgiannelos) [09:50:59] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [09:51:06] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [09:51:22] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [09:51:31] !log installling openssl security updates on Bullseye [09:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:40] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [09:52:17] (03PS4) 10Tiziano Fogli: prom/metamonitor: add dead man switch and public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) [09:53:10] !log dropping term store tables on s8 sanitarium master (T351820) [09:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:13] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [09:55:11] (03PS1) 10Zabe: Remove redundant group0 config for categorylinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167162 [09:55:57] (03CR) 10CI reject: [V:04-1] Remove redundant group0 config for categorylinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167162 (owner: 10Zabe) [09:56:18] (03PS2) 10Zabe: Remove redundant group0 config for categorylinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167162 [09:57:59] (03PS1) 10Zabe: Set categorylinks to read new in cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167164 (https://phabricator.wikimedia.org/T397912) [09:58:05] (03PS1) 10Majavah: hieradata: Bump Striker to 2025-07-08-094946-production [puppet] - 10https://gerrit.wikimedia.org/r/1167165 (https://phabricator.wikimedia.org/T355663) [09:59:46] (03CR) 10Majavah: [C:03+2] hieradata: Bump Striker to 2025-07-08-094946-production [puppet] - 10https://gerrit.wikimedia.org/r/1167165 (https://phabricator.wikimedia.org/T355663) (owner: 10Majavah) [10:00:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1000) [10:01:55] (03PS1) 10Marostegui: db2157: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167168 (https://phabricator.wikimedia.org/T398928) [10:02:12] PROBLEM - MariaDB Replica Lag: s8 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 618.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:03:34] (03CR) 10Marostegui: [C:03+2] db2157: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167168 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [10:03:46] it'll recover soon [10:04:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2157', diff saved to https://phabricator.wikimedia.org/P78795 and previous config saved to /var/cache/conftool/dbconfig/20250708-100434-marostegui.json [10:05:16] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10982848 (10BTullis) [10:05:48] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10982849 (10BTullis) 05Open→03Resolved a:03BTullis [10:06:14] (03PS1) 10Jcrespo: mariadb: Upgrade db1216 & db2201 MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167173 (https://phabricator.wikimedia.org/T398928) [10:07:14] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10982862 (10MoritzMuehlenhoff) [10:07:29] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-codfw [10:09:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-codfw [10:11:40] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-eqiad [10:12:33] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:13:12] RECOVERY - MariaDB Replica Lag: s8 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-eqiad [10:14:36] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:14:37] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [10:16:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [10:20:33] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1159.eqiad.wmnet with reason: Maintenance [10:21:11] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1159.eqiad.wmnet with reason: Maintenance [10:21:12] (03CR) 10Fabfur: varnish: replace X-Public-Cloud with new X-Provenance header check (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [10:21:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1159 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78796 and previous config saved to /var/cache/conftool/dbconfig/20250708-102114-marostegui.json [10:21:35] (03PS10) 10Fabfur: varnish: replace X-Public-Cloud with new X-Provenance header check [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) [10:21:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78797 and previous config saved to /var/cache/conftool/dbconfig/20250708-102140-root.json [10:21:46] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-conf1004.eqiad.wmnet [10:25:17] (03PS1) 10Ladsgroup: api-testing: Loosen the assert on max-age header [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167176 [10:25:49] (03PS2) 10Ladsgroup: Fully get rid of tracking and updating pages [extensions/FlaggedRevs] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167149 (https://phabricator.wikimedia.org/T398033) [10:26:07] jouncebot: nowandnext [10:26:07] For the next 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1000) [10:26:08] In 1 hour(s) and 33 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1200) [10:26:29] (03PS1) 10Clément Goubert: Revert "mw-cron: Disable memory limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167177 [10:26:44] (03CR) 10Ladsgroup: [C:03+2] Fully get rid of tracking and updating pages [extensions/FlaggedRevs] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167149 (https://phabricator.wikimedia.org/T398033) (owner: 10Ladsgroup) [10:26:48] (03CR) 10Ladsgroup: [C:03+2] api-testing: Loosen the assert on max-age header [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167176 (owner: 10Ladsgroup) [10:26:53] (03CR) 10Ladsgroup: [C:03+2] Fully get rid of tracking and updating pages [extensions/FlaggedRevs] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167148 (https://phabricator.wikimedia.org/T398033) (owner: 10Ladsgroup) [10:27:07] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1004.eqiad.wmnet [10:27:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78798 and previous config saved to /var/cache/conftool/dbconfig/20250708-102746-root.json [10:29:30] (03PS1) 10Marostegui: db1159: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167180 (https://phabricator.wikimedia.org/T398928) [10:30:01] (03CR) 10Marostegui: [C:03+2] db1159: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167180 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [10:30:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/FlaggedRevs] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167149 (https://phabricator.wikimedia.org/T398033) (owner: 10Ladsgroup) [10:30:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167176 (owner: 10Ladsgroup) [10:30:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/FlaggedRevs] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167148 (https://phabricator.wikimedia.org/T398033) (owner: 10Ladsgroup) [10:31:03] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1159.eqiad.wmnet with reason: Maintenance [10:31:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1159 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78799 and previous config saved to /var/cache/conftool/dbconfig/20250708-103106-marostegui.json [10:32:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:33:27] (03CR) 10Vgutierrez: [C:03+1] "please note that this will break at least`cache-text/public_cloud_deprecated_api`" [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [10:34:05] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all [10:34:31] (03CR) 10Vgutierrez: [C:03+1] varnish: replace X-Public-Cloud with new X-Provenance header check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [10:36:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78800 and previous config saved to /var/cache/conftool/dbconfig/20250708-103645-root.json [10:37:15] !log reboot apus frontends in eqiad T395240 [10:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:34] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-cluster [10:37:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:38:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78801 and previous config saved to /var/cache/conftool/dbconfig/20250708-103826-root.json [10:42:16] (03Merged) 10jenkins-bot: api-testing: Loosen the assert on max-age header [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167176 (owner: 10Ladsgroup) [10:42:18] (03Merged) 10jenkins-bot: Fully get rid of tracking and updating pages [extensions/FlaggedRevs] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167149 (https://phabricator.wikimedia.org/T398033) (owner: 10Ladsgroup) [10:42:21] (03Merged) 10jenkins-bot: Fully get rid of tracking and updating pages [extensions/FlaggedRevs] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167148 (https://phabricator.wikimedia.org/T398033) (owner: 10Ladsgroup) [10:42:56] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1167149|Fully get rid of tracking and updating pages (T398033)]], [[gerrit:1167176|api-testing: Loosen the assert on max-age header]], [[gerrit:1167148|Fully get rid of tracking and updating pages (T398033)]] [10:42:59] T398033: Traffic spike on s7 due to heavy update query - https://phabricator.wikimedia.org/T398033 [10:43:47] (03PS1) 10Marostegui: db1175: Remove RBR [puppet] - 10https://gerrit.wikimedia.org/r/1167184 [10:43:58] (03PS1) 10Jcrespo: dbbackups: Upgrade dbprov1005 & dbprov2005 MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167185 (https://phabricator.wikimedia.org/T394487) [10:44:22] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-conf1005.eqiad.wmnet [10:44:29] (03CR) 10Marostegui: [C:03+2] db1175: Remove RBR [puppet] - 10https://gerrit.wikimedia.org/r/1167184 (owner: 10Marostegui) [10:45:07] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1167149|Fully get rid of tracking and updating pages (T398033)]], [[gerrit:1167176|api-testing: Loosen the assert on max-age header]], [[gerrit:1167148|Fully get rid of tracking and updating pages (T398033)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:47:00] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [10:49:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1005.eqiad.wmnet [10:51:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78802 and previous config saved to /var/cache/conftool/dbconfig/20250708-105151-root.json [10:52:29] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167149|Fully get rid of tracking and updating pages (T398033)]], [[gerrit:1167176|api-testing: Loosen the assert on max-age header]], [[gerrit:1167148|Fully get rid of tracking and updating pages (T398033)]] (duration: 09m 33s) [10:52:32] T398033: Traffic spike on s7 due to heavy update query - https://phabricator.wikimedia.org/T398033 [10:53:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78803 and previous config saved to /var/cache/conftool/dbconfig/20250708-105332-root.json [10:53:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:54:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all [10:54:06] !log reboot apus frontends in codfw T395240 [10:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:12] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-cluster [10:56:25] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [10:56:25] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host matomo1003.eqiad.wmnet [10:58:09] (03CR) 10Fabfur: "ack, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [10:58:20] (03PS11) 10Fabfur: varnish: replace X-Public-Cloud with new X-Provenance header check [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) [11:00:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host matomo1003.eqiad.wmnet [11:00:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [11:02:50] 4 [11:03:40] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2201.codfw.wmnet,db1216.eqiad.wmnet with reason: MariaDB package update [11:04:03] !log jmm@cumin1002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:cloudelastic [11:06:20] !log jmm@cumin1002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:cloudelastic [11:06:30] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [11:06:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78805 and previous config saved to /var/cache/conftool/dbconfig/20250708-110656-root.json [11:07:08] !log jmm@cumin1002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:elastic-codfw [11:07:46] (03PS1) 10Majavah: openstack::patch: Disable fuzzing patch locations [puppet] - 10https://gerrit.wikimedia.org/r/1167189 [11:08:20] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Fix requestctl= sanitization [puppet] - 10https://gerrit.wikimedia.org/r/1166775 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez) [11:08:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78806 and previous config saved to /var/cache/conftool/dbconfig/20250708-110838-root.json [11:09:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [11:09:54] (03CR) 10Majavah: [C:03+2] openstack::patch: Disable fuzzing patch locations [puppet] - 10https://gerrit.wikimedia.org/r/1167189 (owner: 10Majavah) [11:10:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [11:11:20] (03PS2) 10Jcrespo: mariadb: Upgrade db1216 & db2201 MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167173 (https://phabricator.wikimedia.org/T398928) [11:13:25] (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade db1216 & db2201 MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167173 (https://phabricator.wikimedia.org/T398928) (owner: 10Jcrespo) [11:15:16] !log restarting slapd on seaborgium/serpens to pick up OpenSSL updates [11:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Netbox: remove old cr2-codfw Switch Control Board inventory items - https://phabricator.wikimedia.org/T398940 (10cmooney) 03NEW p:05Triage→03Medium [11:19:25] jouncebot: nowandnext [11:19:25] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [11:19:25] In 0 hour(s) and 40 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1200) [11:20:20] (03CR) 10Zabe: [C:03+2] Remove redundant group0 config for categorylinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167162 (owner: 10Zabe) [11:20:21] (03CR) 10Zabe: [C:03+2] Set categorylinks to read new in cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167164 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [11:20:37] !log upgrade db1216 mariadb package T394487 [11:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:39] T394487: Migrate backup sources to MariaDB 10.11 - https://phabricator.wikimedia.org/T394487 [11:21:10] (03Merged) 10jenkins-bot: Remove redundant group0 config for categorylinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167162 (owner: 10Zabe) [11:21:12] (03Merged) 10jenkins-bot: Set categorylinks to read new in cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167164 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [11:22:03] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1167162|Remove redundant group0 config for categorylinks]], [[gerrit:1167164|Set categorylinks to read new in cebwiki (T397912)]] [11:22:06] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [11:23:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78807 and previous config saved to /var/cache/conftool/dbconfig/20250708-112344-root.json [11:24:00] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host db1208.eqiad.wmnet [11:24:08] !log zabe@deploy1003 zabe: Backport for [[gerrit:1167162|Remove redundant group0 config for categorylinks]], [[gerrit:1167164|Set categorylinks to read new in cebwiki (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:25:59] !log zabe@deploy1003 zabe: Continuing with sync [11:27:01] (03PS1) 10Clément Goubert: check_user: Use deploy instead of mwmaint [puppet] - 10https://gerrit.wikimedia.org/r/1167195 (https://phabricator.wikimedia.org/T397017) [11:27:04] (03PS1) 10Clément Goubert: mwaint: Remove from scap [puppet] - 10https://gerrit.wikimedia.org/r/1167196 (https://phabricator.wikimedia.org/T397017) [11:27:06] (03PS1) 10Clément Goubert: mwmaint: deprecate mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) [11:27:20] !log jmm@cumin1002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:elastic-codfw [11:27:24] !log restarting apache on mirror1001 to pick up openssl sec updates [11:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:44] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync [11:29:50] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [11:31:39] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167162|Remove redundant group0 config for categorylinks]], [[gerrit:1167164|Set categorylinks to read new in cebwiki (T397912)]] (duration: 09m 35s) [11:31:42] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [11:33:49] (03CR) 10Muehlenhoff: "I think you can simply ignore this script; it's already broken and we'll most likely just remove it: https://phabricator.wikimedia.org/T39" [puppet] - 10https://gerrit.wikimedia.org/r/1167195 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [11:34:54] (03CR) 10Clément Goubert: "Ack." [puppet] - 10https://gerrit.wikimedia.org/r/1167195 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [11:35:02] !log Restarted Apache on gerrit1003 and gerrit2002 [11:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:09] (03Abandoned) 10Clément Goubert: check_user: Use deploy instead of mwmaint [puppet] - 10https://gerrit.wikimedia.org/r/1167195 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [11:35:12] (03CR) 10Muehlenhoff: mwmaint: deprecate mwmaint servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [11:35:18] (03CR) 10Hnowlan: [C:03+1] mwaint: Remove from scap [puppet] - 10https://gerrit.wikimedia.org/r/1167196 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [11:35:44] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host db1208.eqiad.wmnet [11:36:13] !log jmm@cumin1002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:relforge [11:36:16] PROBLEM - MariaDB Replica IO: matomo on db1208 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:36:18] PROBLEM - mysqld processes on db1208 is CRITICAL: PROCS CRITICAL: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:36:18] PROBLEM - MariaDB Replica Lag: matomo on db1208 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:36:18] PROBLEM - MariaDB Replica SQL: matomo on db1208 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:36:18] PROBLEM - MariaDB read only matomo on db1208 is CRITICAL: Could not connect to localhost:3351 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:36:26] (03PS2) 10Clément Goubert: mwmaint: deprecate mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) [11:36:26] btullis: ^ [11:36:36] (03CR) 10Clément Goubert: mwmaint: deprecate mwmaint servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [11:37:04] !log jmm@cumin1002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:relforge [11:37:43] (03CR) 10Clément Goubert: [C:03+2] Remove l10nupdate manifests [puppet] - 10https://gerrit.wikimedia.org/r/928582 (owner: 10Majavah) [11:38:05] (03CR) 10Hnowlan: [C:03+1] mwmaint: deprecate mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [11:39:11] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [11:39:16] (03CR) 10Hnowlan: [C:03+1] Revert "mw-cron: Disable memory limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167177 (owner: 10Clément Goubert) [11:39:18] RECOVERY - mysqld processes on db1208 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:39:18] RECOVERY - MariaDB Replica SQL: matomo on db1208 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:39:20] RECOVERY - MariaDB read only matomo on db1208 is OK: Version 10.6.18-MariaDB-log, Uptime 26s, read_only: True, event_scheduler: True, 11.22 QPS, connection latency: 0.033039s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:39:20] Apologioes for the noise re db1208 - that was me. [11:39:43] (03PS1) 10Vgutierrez: Revert "cache,haproxy: Remove http response captures" [puppet] - 10https://gerrit.wikimedia.org/r/1167200 [11:39:44] !log upgrade db2201 mariadb package T394487 [11:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:46] T394487: Migrate backup sources to MariaDB 10.11 - https://phabricator.wikimedia.org/T394487 [11:39:58] (03CR) 10CI reject: [V:04-1] Revert "cache,haproxy: Remove http response captures" [puppet] - 10https://gerrit.wikimedia.org/r/1167200 (owner: 10Vgutierrez) [11:40:16] RECOVERY - MariaDB Replica IO: matomo on db1208 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:40:18] RECOVERY - MariaDB Replica Lag: matomo on db1208 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:40:58] (03CR) 10Muehlenhoff: [C:04-1] "Actually, I forgot one thing: These are still on Puppet 5, and the insetup roles default to Puppet 7, so instead we'll need to set these t" [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [11:41:36] (03PS3) 10Clément Goubert: mwmaint: deprecate mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) [11:41:57] (03CR) 10Clément Goubert: "Done." [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [11:42:13] !log jmm@cumin1002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:elastic-eqiad [11:43:05] (03PS2) 10Vgutierrez: Revert "cache,haproxy: Remove http response captures" [puppet] - 10https://gerrit.wikimedia.org/r/1167200 (https://phabricator.wikimedia.org/T397917) [11:43:56] (03CR) 10Hnowlan: [C:03+2] api-gateway: use ratelimit's inbuilt promethus-statsd agent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166790 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan) [11:44:18] (03CR) 10KartikMistry: [C:03+2] machinetranslation: Use s3 for model download in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166543 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [11:44:39] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1001.eqiad.wmnet [11:45:50] (03Merged) 10jenkins-bot: api-gateway: use ratelimit's inbuilt promethus-statsd agent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166790 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan) [11:46:05] (03Merged) 10jenkins-bot: machinetranslation: Use s3 for model download in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166543 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [11:46:19] (03CR) 10Hashar: gerrit: avoid hardcoded hostnames, replace with hiera lookups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [11:47:05] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6195/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167200 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez) [11:49:08] !log restarting exim on Phabricator nodes to pick up OpenSSL updates [11:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1001.eqiad.wmnet [11:49:59] (03CR) 10Brouberol: [C:03+1] Add the new cephosd200[1-3] servers in codfw to their role [puppet] - 10https://gerrit.wikimedia.org/r/1166866 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [11:51:03] (03CR) 10Hashar: gerrit: config replicas for rename-project plugin (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [11:51:17] (03PS4) 10Hashar: gerrit: config replicas for rename-project plugin [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) [11:52:30] !log restarting FPM on Phabricator nodes to pick up OpenSSL updates [11:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:33] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1002.eqiad.wmnet [11:52:53] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [11:52:53] (03PS1) 10Vgutierrez: cache::haproxy: Replace res.hdr() with res.fhdr() [puppet] - 10https://gerrit.wikimedia.org/r/1167203 (https://phabricator.wikimedia.org/T397917) [11:52:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [11:53:21] (03CR) 10Btullis: [V:03+1 C:03+2] Add the new cephosd200[1-3] servers in codfw to their role [puppet] - 10https://gerrit.wikimedia.org/r/1166866 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [11:54:26] !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cephosd[2001-2003].codfw.wmnet with reason: Bootstrapping new ceph cluster [11:54:52] (03PS2) 10Vgutierrez: cache::haproxy: Replace hdr() with fhdr() [puppet] - 10https://gerrit.wikimedia.org/r/1167203 (https://phabricator.wikimedia.org/T397917) [11:56:47] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167203 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez) [11:57:33] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [11:59:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1002.eqiad.wmnet [11:59:54] (03CR) 10FNegri: [C:03+1] "Let's keep it around for now, it might be useful for T381587." [puppet] - 10https://gerrit.wikimedia.org/r/989542 (owner: 10Majavah) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1200) [12:00:08] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-codfw [12:00:31] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [12:00:42] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [12:00:59] (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM; I would've fixed only X-analytics in this patch, to reduce the risk, but proceed as you prefer." [puppet] - 10https://gerrit.wikimedia.org/r/1167203 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez) [12:01:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-codfw [12:01:49] (03PS3) 10Majavah: P:wmcs::db: querysampler: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/989542 [12:02:25] (03CR) 10FNegri: [C:03+1] P:wmcs::db: querysampler: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/989542 (owner: 10Majavah) [12:02:42] !log jmm@cumin1002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:elastic-eqiad [12:03:04] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-eqiad [12:03:55] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Replace hdr() with fhdr() [puppet] - 10https://gerrit.wikimedia.org/r/1167203 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez) [12:04:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-eqiad [12:06:05] (03CR) 10Majavah: [C:03+2] P:wmcs::db: querysampler: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/989542 (owner: 10Majavah) [12:06:08] 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 13Patch-For-Review, 10Sustainability (Incident Followup): Automatically compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253#10983399 (10Ladsgroup) The above patch needs reworking to take advantage of tables cat... [12:06:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:06:34] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [12:06:46] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [12:06:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:08:59] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [12:09:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [12:10:16] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [12:10:35] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [12:10:39] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3414 MB (3% inode=98%): /tmp 3414 MB (3% inode=98%): /var/tmp 3414 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [12:11:13] (03CR) 10Jcrespo: "Are you aware of the details of db-compare? In addition to ids, sometimes an --order-by is needed as it may produce false positives due to" [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [12:12:22] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [12:12:41] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [12:20:01] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host cephosd2001.codfw.wmnet with OS bookworm [12:20:15] !log btullis@cumin1003 START - Cookbook sre.hosts.move-vlan for host cephosd2001 [12:20:48] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host cephosd2002.codfw.wmnet with OS bookworm [12:20:58] 06SRE, 10SRE-Access-Requests: Requesting access to LogStash for DSantamaria (IDP) - https://phabricator.wikimedia.org/T398956 (10DSantamaria) 03NEW [12:21:02] !log btullis@cumin1003 START - Cookbook sre.hosts.move-vlan for host cephosd2002 [12:21:22] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host cephosd2003.codfw.wmnet with OS bookworm [12:21:28] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [12:21:37] !log btullis@cumin1003 START - Cookbook sre.hosts.move-vlan for host cephosd2003 [12:21:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:49] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cephosd2001 - btullis@cumin1003" [12:24:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cephosd2001 - btullis@cumin1003" [12:24:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:24:53] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache cephosd2001.codfw.wmnet 133.0.192.10.in-addr.arpa 3.3.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:24:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cephosd2001.codfw.wmnet 133.0.192.10.in-addr.arpa 3.3.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:24:57] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cephosd2001 [12:25:04] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [12:25:39] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be1002.eqiad.wmnet [12:26:04] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2003.codfw.wmnet [12:26:06] 06SRE, 10SRE-Access-Requests: Requesting access to LogStash for DSantamaria (IDP) - https://phabricator.wikimedia.org/T398956#10983504 (10MoritzMuehlenhoff) Access to Logstash is handled via Wikimedia IDM, please see https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access for details [12:26:13] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:26:24] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:26:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cephosd2001 [12:26:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cephosd2001 [12:27:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:27:41] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache cephosd2003.codfw.wmnet 240.48.192.10.in-addr.arpa 0.4.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:27:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cephosd2003.codfw.wmnet 240.48.192.10.in-addr.arpa 0.4.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:27:44] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cephosd2003 [12:28:11] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [12:28:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cephosd2003 [12:28:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cephosd2003 [12:30:42] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1002.eqiad.wmnet [12:30:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:30:46] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache cephosd2002.codfw.wmnet 235.32.192.10.in-addr.arpa 5.3.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:30:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cephosd2002.codfw.wmnet 235.32.192.10.in-addr.arpa 5.3.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:30:50] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cephosd2002 [12:31:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2003.codfw.wmnet [12:31:34] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be1003.eqiad.wmnet [12:32:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host apus-be2004.codfw.wmnet [12:32:23] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cephosd2002 [12:32:23] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cephosd2002 [12:33:10] (03PS1) 10Btullis: Add the new dse-k8s hosts to site.pp so that we can create the VMs [puppet] - 10https://gerrit.wikimedia.org/r/1167209 (https://phabricator.wikimedia.org/T397293) [12:34:52] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167209 (https://phabricator.wikimedia.org/T397293) (owner: 10Btullis) [12:36:55] (03CR) 10Btullis: [C:03+2] Add the new dse-k8s hosts to site.pp so that we can create the VMs [puppet] - 10https://gerrit.wikimedia.org/r/1167209 (https://phabricator.wikimedia.org/T397293) (owner: 10Btullis) [12:38:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apus-be2004.codfw.wmnet [12:38:10] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1003.eqiad.wmnet [12:39:43] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host apus-be1004.eqiad.wmnet [12:40:17] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2001.codfw.wmnet [12:40:38] !log installing commons-beanutils security updates [12:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:46] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apus-be1004.eqiad.wmnet [12:44:41] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be1001.eqiad.wmnet [12:45:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2001.codfw.wmnet [12:46:44] !log btullis@cumin1003 START - Cookbook sre.ganeti.makevm for new host dse-k8s-etcd2001.codfw.wmnet [12:46:45] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [12:46:49] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2002.codfw.wmnet [12:48:47] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd2001.codfw.wmnet with reason: host reimage [12:49:50] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:49:54] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1001.eqiad.wmnet [12:49:56] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:50:04] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:50:43] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd2003.codfw.wmnet with reason: host reimage [12:51:22] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:51:32] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:51:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2002.codfw.wmnet [12:52:26] btullis@cumin1003 makevm (PID 816983) is awaiting input [12:52:29] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:52:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd2001.codfw.wmnet with reason: host reimage [12:54:28] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd2002.codfw.wmnet with reason: host reimage [12:54:37] (03PS2) 10Jcrespo: dbbackups: Upgrade dbprov1005 & dbprov2005 MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167185 (https://phabricator.wikimedia.org/T394487) [12:54:44] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM dse-k8s-etcd2001.codfw.wmnet - btullis@cumin1003" [12:54:49] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM dse-k8s-etcd2001.codfw.wmnet - btullis@cumin1003" [12:54:49] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:54:49] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-etcd2001.codfw.wmnet on all recursors [12:54:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-etcd2001.codfw.wmnet on all recursors [12:55:16] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM dse-k8s-etcd2001.codfw.wmnet - btullis@cumin1003" [12:55:21] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM dse-k8s-etcd2001.codfw.wmnet - btullis@cumin1003" [12:56:27] !log installing ICU security updates [12:56:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd2003.codfw.wmnet with reason: host reimage [12:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:39] !log installing ICU security updates on Bookworm [12:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:30] PROBLEM - Host cephosd2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:21] btullis@cumin1003 makevm (PID 816983) is awaiting input [12:59:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd2002.codfw.wmnet with reason: host reimage [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:02:33] RECOVERY - Host cephosd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [13:04:38] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbprov2005.codfw.wmnet,dbprov1005.eqiad.wmnet with reason: MariaDB package update [13:07:29] (03PS1) 10CDobbins: varnish: selectively increase NetworkProbeLimit [puppet] - 10https://gerrit.wikimedia.org/r/1167215 [13:07:44] (03CR) 10Jcrespo: [C:03+2] dbbackups: Upgrade dbprov1005 & dbprov2005 MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167185 (https://phabricator.wikimedia.org/T394487) (owner: 10Jcrespo) [13:10:48] (03PS1) 10Giuseppe Lavagetto: Roll-forward again [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167216 [13:11:09] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Roll-forward again [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167216 (owner: 10Giuseppe Lavagetto) [13:12:21] (03PS1) 10Ladsgroup: Set purge values for parsercache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167217 (https://phabricator.wikimedia.org/T398806) [13:14:53] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-etcd2001.codfw.wmnet with OS bookworm [13:14:56] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Revert logging changes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167079 (owner: 10Giuseppe Lavagetto) [13:17:10] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Do not log rate-limiting rules if it wouldn\'t be applied - oblivian@cumin1003" [13:17:11] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Do not log rate-limiting rules if it wouldn\'t be applied - oblivian@cumin1003 [13:17:47] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Do not log rate-limiting rules if it wouldn\'t be applied - oblivian@cumin1003 [13:17:48] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Do not log rate-limiting rules if it wouldn\'t be applied - oblivian@cumin1003" [13:18:03] !log restarting Postfix on mx* and crm2001 to pick up ICU security updates [13:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:31] !log restart clamav on VRTS to pick up ICU security updates [13:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:43] !log cmooney@cumin2002 START - Cookbook sre.dns.netbox [13:23:41] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:23:58] !log cmooney@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add entries for new ML mega-hosts in eqiad - cmooney@cumin2002" [13:24:31] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:25:11] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10983753 (10MoritzMuehlenhoff) [13:26:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:26:22] (03CR) 10Ssingh: [C:03+1] varnish: Prevent unknown clients from reaching /evt-103e/v2/events [puppet] - 10https://gerrit.wikimedia.org/r/1167151 (https://phabricator.wikimedia.org/T398181) (owner: 10Vgutierrez) [13:26:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:27:03] cmooney@cumin2002 netbox (PID 2989694) is awaiting input [13:27:56] !log cmooney@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add entries for new ML mega-hosts in eqiad - cmooney@cumin2002" [13:27:57] !log cmooney@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:27:58] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921#10983772 (10Arendpieter) [13:29:01] !llog installing rsync security updates [13:30:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10983794 (10cmooney) [13:32:13] (03Abandoned) 10Vgutierrez: Revert "cache,haproxy: Remove http response captures" [puppet] - 10https://gerrit.wikimedia.org/r/1167200 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez) [13:32:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10983806 (10cmooney) Updated task description there. I ran the Provision a server Netbox script for ml-serve1012, ml-serve1013 and ml-serve1014 just now, as well as t... [13:32:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10983807 (10Jclark-ctr) @cmooney thanks for assisting with network ports [13:34:19] 06SRE, 06Traffic, 05FY2025-26 WE3.3 Engaging core audiences: [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#10983824 (10Jdrewniak) [13:35:01] 06SRE, 06Traffic, 05FY2025-26 WE3.3 Engaging core audiences: [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#10983830 (10Jdrewniak) [13:37:46] (03PS1) 10Ssingh: cookbook: add sre.cdn.roll-restart-haproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1167222 [13:38:07] (03CR) 10Vgutierrez: [C:03+2] varnish: Prevent unknown clients from reaching /evt-103e/v2/events [puppet] - 10https://gerrit.wikimedia.org/r/1167151 (https://phabricator.wikimedia.org/T398181) (owner: 10Vgutierrez) [13:40:14] 06SRE, 10SRE-Access-Requests: Requesting access to LogStash for DSantamaria (IDP) - https://phabricator.wikimedia.org/T398956#10983916 (10DSantamaria) 05Open→03Resolved a:03DSantamaria Thanks! [13:40:44] !log installing werkzeug security updates [13:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:46] (03CR) 10Vgutierrez: [C:03+1] "tested with ` test-cookbook -d -c 1167222 sre.cdn.roll-restart-haproxy --alias cp-ulsfo_upload --reason 'OpenSSL update'` and `test-cookbo" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167222 (owner: 10Ssingh) [13:45:41] (03CR) 10Ssingh: [V:03+2 C:03+2] cookbook: add sre.cdn.roll-restart-haproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1167222 (owner: 10Ssingh) [13:47:41] (03CR) 10Arnaudb: [C:03+1] gerrit: avoid hardcoded hostnames, replace with hiera lookups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [13:48:59] (03CR) 10Ssingh: [V:03+2 C:03+2] "The +2 was not intentional and a mistake on my part. CI has now finished running." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167222 (owner: 10Ssingh) [13:49:31] (03CR) 10Ssingh: [V:03+1 C:03+2] cookbook: add sre.cdn.roll-restart-haproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1167222 (owner: 10Ssingh) [13:49:37] (03CR) 10Ssingh: [V:03+1 C:03+2] "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167222 (owner: 10Ssingh) [13:50:38] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.topology-check Validate Gerrit topology (source=gerrit1003, replica=gerrit2003) [13:50:40] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.gerrit.topology-check (exit_code=99) Validate Gerrit topology (source=gerrit1003, replica=gerrit2003) [13:52:37] (03Merged) 10jenkins-bot: cookbook: add sre.cdn.roll-restart-haproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1167222 (owner: 10Ssingh) [13:53:14] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.topology-check Validate Gerrit topology (source=gerrit1003, replica=gerrit2002) [13:53:18] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.topology-check (exit_code=0) Validate Gerrit topology (source=gerrit1003, replica=gerrit2002) [13:53:25] PROBLEM - Confd vcl based reload on cp4043 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:53:52] ^ vgutierrez, is this you and the recent change? [13:53:56] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2222.mgmt:22 - https://phabricator.wikimedia.org/T398577#10984003 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm pings. D3 breaker reset [13:54:10] hmmm no [13:54:17] interesting [13:54:19] that sounds like _joe_/fabfur [13:54:20] let's look [13:54:21] but let me check [13:54:30] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for restbase2035.mgmt:22 - https://phabricator.wikimedia.org/T398576#10984009 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch in D8. pings. [13:54:48] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2214.mgmt:22 - https://phabricator.wikimedia.org/T398575#10984015 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker rest. pings. [13:55:09] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2227.mgmt:22 - https://phabricator.wikimedia.org/T398574#10984023 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker reset. pings [13:55:09] oh.. reload vcl [13:55:10] that's me [13:55:31] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2225.mgmt:22 - https://phabricator.wikimedia.org/T398566#10984029 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker reset. pings [13:55:44] (03PS1) 10Xcollazo: analytics: Absent rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1167224 (https://phabricator.wikimedia.org/T396031) [13:55:53] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2223.mgmt:22 - https://phabricator.wikimedia.org/T398558#10984036 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker reset. pings. [13:56:43] sukhe: I might need some coffee but I don't see the problem on cp4043 [13:56:48] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for restbase2038.mgmt:22 - https://phabricator.wikimedia.org/T398555#10984045 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker reset. pings. [13:58:21] <_joe_> sukhe: confd seems to think everything's fine by itself [13:58:28] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2218.mgmt:22 - https://phabricator.wikimedia.org/T398554#10984054 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker reset. pings. [13:58:42] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for maps2008.mgmt:22 - https://phabricator.wikimedia.org/T398553#10984058 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker rest. pings. [13:58:49] where is that alert coming from? [13:59:03] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for mc-misc2002.mgmt:22 - https://phabricator.wikimedia.org/T398552#10984062 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch in D8. pings. [13:59:25] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2226.mgmt:22 - https://phabricator.wikimedia.org/T398551#10984066 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker reset. [13:59:29] that's confd_resource_healthy... [13:59:44] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2193.mgmt:22 - https://phabricator.wikimedia.org/T398550#10984072 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker reset. pings. [13:59:57] !incidents [13:59:58] 6459 (RESOLVED) [3x] ATSBackendErrorsHigh cache_text sre (eventgate-analytics-external.discovery.wmnet) [13:59:58] 6458 (RESOLVED) ATSBackendErrorsHigh cache_text sre (eventgate-analytics-external.discovery.wmnet eqsin) [13:59:58] 6457 (RESOLVED) ATSBackendErrorsHigh cache_text sre (eventgate-analytics-external.discovery.wmnet eqsin) [14:00:04] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2319.mgmt:22 - https://phabricator.wikimedia.org/T398549#10984079 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker reset. pings. [14:00:23] `confd_vcl_reload_success 0` [14:00:40] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2216.mgmt:22 - https://phabricator.wikimedia.org/T398548#10984083 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker reset. pings. [14:00:51] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2192.mgmt:22 - https://phabricator.wikimedia.org/T398547#10984090 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch in D8. pings. [14:00:52] (03CR) 10JHathaway: [C:03+1] tox.ini: skip Python 3.10 in CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167081 (owner: 10Volans) [14:01:05] so that comes from confd-reload-vcl.sh [14:01:13] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for conf2006.mgmt:22 - https://phabricator.wikimedia.org/T398546#10984094 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch in D8. pings. [14:01:17] _joe_: latest requestctl commit triggered that error for some reason? [14:01:43] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2220.mgmt:22 - https://phabricator.wikimedia.org/T398545#10984099 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker reset. pings. [14:02:20] crap... meeting :D [14:02:30] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for arclamp2001.mgmt:22 - https://phabricator.wikimedia.org/T398543#10984117 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch in D8. pings. [14:02:33] same :| [14:02:34] will be back in 15 [14:02:53] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for gerrit2002.mgmt:22 - https://phabricator.wikimedia.org/T398542#10984121 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch in D8. pings. [14:03:25] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for cirrussearch2115.mgmt:22 - https://phabricator.wikimedia.org/T398541#10984128 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker rest. pings. [14:03:50] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2200.mgmt:22 - https://phabricator.wikimedia.org/T398540#10984132 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker reset. pings. [14:04:07] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2152.mgmt:22 - https://phabricator.wikimedia.org/T398539#10984136 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch in D8. pings. [14:04:21] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for thanos-fe2007.mgmt:22 - https://phabricator.wikimedia.org/T398538#10984140 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch in D8. pings. [14:05:10] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on db2213 - https://phabricator.wikimedia.org/T398537#10984148 (10Jhancock.wm) 05Open→03Resolved breaker reset. alert cleared. [14:05:31] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2318.mgmt:22 - https://phabricator.wikimedia.org/T398536#10984162 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker reset. pings. [14:11:08] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS2 Status - issue on wikikube-worker2320:9290 - https://phabricator.wikimedia.org/T398514#10984187 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm D3 breaker reset. alert cleared. [14:18:11] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-etcd2001.codfw.wmnet with reason: host reimage [14:21:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-etcd2001.codfw.wmnet with reason: host reimage [14:22:29] (03PS1) 10Vgutierrez: cdn.roll-upgrade-haproxy: Run puppet and then restart haproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1167229 [14:22:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10984249 (10Marostegui) [14:23:53] !log btullis@cumin1003 START - Cookbook sre.ganeti.makevm for new host dse-k8s-etcd2002.codfw.wmnet [14:23:54] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [14:24:57] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti2019 / ganeti2020 - https://phabricator.wikimedia.org/T398671#10984256 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:25:16] (03PS1) 10Marostegui: db1185: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167231 (https://phabricator.wikimedia.org/T398928) [14:25:48] (03CR) 10Marostegui: [C:03+2] db1185: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167231 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [14:26:08] !log btullis@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:26:26] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [14:26:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:26:32] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1185.eqiad.wmnet with reason: Maintenance [14:26:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1185 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78811 and previous config saved to /var/cache/conftool/dbconfig/20250708-142635-marostegui.json [14:26:55] (03PS1) 10Arnaudb: gerrit: standardize expected rc on systemctl check [cookbooks] - 10https://gerrit.wikimedia.org/r/1167226 (https://phabricator.wikimedia.org/T387833) [14:26:55] (03CR) 10Arnaudb: "You can test this patch with:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167226 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [14:27:46] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephosd2003-dev.codfw.wmnet - https://phabricator.wikimedia.org/T397979#10984285 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:28:41] !log btullis@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1430) [14:30:08] (03PS1) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1167233 [14:31:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:31:41] btullis@cumin1003 makevm (PID 834139) is awaiting input [14:34:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78812 and previous config saved to /var/cache/conftool/dbconfig/20250708-143422-root.json [14:34:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:36:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:36:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd2001.codfw.wmnet with OS bookworm [14:36:35] (03PS3) 10Hnowlan: hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1164432 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková) [14:39:07] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-etcd2001.codfw.wmnet with OS bookworm [14:39:07] !log btullis@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-etcd2001.codfw.wmnet [14:41:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:41:30] !log installing shadow security updates [14:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:38] (03CR) 10Ssingh: [C:03+1] "Fair enough!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167229 (owner: 10Vgutierrez) [14:41:52] (03CR) 10Vgutierrez: [C:03+2] cdn.roll-upgrade-haproxy: Run puppet and then restart haproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1167229 (owner: 10Vgutierrez) [14:44:14] (03CR) 10Hnowlan: [C:03+2] Add fake hcaptcha proxy secrets. [labs/private] - 10https://gerrit.wikimedia.org/r/1155221 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková) [14:44:22] (03CR) 10Hnowlan: [V:03+2 C:03+2] Add fake hcaptcha proxy secrets. [labs/private] - 10https://gerrit.wikimedia.org/r/1155221 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková) [14:45:44] !log btullis@cumin1003 START - Cookbook sre.ganeti.makevm for new host dse-k8s-etcd2003.codfw.wmnet [14:45:46] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [14:45:48] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164432 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková) [14:46:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:29] (03CR) 10Hnowlan: [C:03+1] "Thank you!" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1167233 (owner: 10Muehlenhoff) [14:46:31] RECOVERY - Confd vcl based reload on cp4043 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:46:38] sukhe: ^^ [14:47:01] :D [14:47:23] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-eqsin and not P{cp[5017,5025].eqsin.wmnet} and A:cp - 2.8.15 upgrade (T398720) [14:47:26] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [14:48:20] (03CR) 10Herron: "Yes, its based on the typical run time and with understanding that sometimes the agent may hit the lock and wait another 5m. Aiming for a" [puppet] - 10https://gerrit.wikimedia.org/r/1166846 (https://phabricator.wikimedia.org/T398444) (owner: 10Herron) [14:49:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78813 and previous config saved to /var/cache/conftool/dbconfig/20250708-144928-root.json [14:49:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:49:59] !log Ran fixStuckGlobalRename.php for T398837 [14:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:01] T398837: Unblock stuck global rename of Princekng1425 - https://phabricator.wikimedia.org/T398837 [14:50:19] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM dse-k8s-etcd2003.codfw.wmnet - btullis@cumin1003" [14:50:23] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM dse-k8s-etcd2003.codfw.wmnet - btullis@cumin1003" [14:50:23] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:50:23] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-etcd2003.codfw.wmnet on all recursors [14:50:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-etcd2003.codfw.wmnet on all recursors [14:50:30] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [14:50:47] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM dse-k8s-etcd2003.codfw.wmnet - btullis@cumin1003" [14:50:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM dse-k8s-etcd2003.codfw.wmnet - btullis@cumin1003" [14:51:00] (03CR) 10Muehlenhoff: [C:03+2] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1167233 (owner: 10Muehlenhoff) [14:51:15] (03CR) 10Fabfur: [C:03+1] "so long mwdebug!" [puppet] - 10https://gerrit.wikimedia.org/r/1164207 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [14:51:30] (03CR) 10JHathaway: [C:03+1] "looks good, the empty string as a signal to create a noop request feels a little odd, but I don't have a great alternative suggestion." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1154787 (owner: 10Volans) [14:52:00] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-etcd2003.codfw.wmnet with OS bookworm [14:53:06] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:53:06] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-etcd2002.codfw.wmnet on all recursors [14:53:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-etcd2002.codfw.wmnet on all recursors [14:53:32] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM dse-k8s-etcd2002.codfw.wmnet - btullis@cumin1003" [14:53:34] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-restart-ats rolling restart_daemons on A:cp-magru [14:55:49] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-restart-ats rolling restart_daemons on A:cp-ulsfo [14:56:37] btullis@cumin1003 makevm (PID 834139) is awaiting input [14:57:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM dse-k8s-etcd2002.codfw.wmnet - btullis@cumin1003" [14:58:00] (03CR) 10Volans: [C:03+2] "Ack, thanks. Yes that's what we came out with Luca in wmflib as a way to allow a more ease use on the spicerack side, but is behind a flag" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1154787 (owner: 10Volans) [14:58:05] (03CR) 10Volans: [C:03+2] tox.ini: skip Python 3.10 in CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167081 (owner: 10Volans) [15:00:05] jelto, arnoldokoth, and mutante: gettimeofday() says it's time for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1500) [15:00:34] btullis@cumin1003 makevm (PID 834139) is awaiting input [15:00:58] (03Abandoned) 10JHathaway: Add vendor exclusion to DHCPConfMac [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163801 (owner: 10JHathaway) [15:02:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd2002.codfw.wmnet with OS bookworm [15:04:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78814 and previous config saved to /var/cache/conftool/dbconfig/20250708-150434-root.json [15:05:56] (03CR) 10Herron: [C:03+2] pyrra-filesystem: clear output files on service start [puppet] - 10https://gerrit.wikimedia.org/r/1165571 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:09:59] (03Merged) 10jenkins-bot: cookbook API: simplify -t/--task-id support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1154787 (owner: 10Volans) [15:10:00] (03Merged) 10jenkins-bot: tox.ini: skip Python 3.10 in CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167081 (owner: 10Volans) [15:11:19] (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167240 [15:11:46] gonna deploy a JsonConfig fix for Charts -- 1166942 [15:11:48] (03CR) 10Filippo Giunchedi: [C:03+1] "Ok thank you for the explanation" [puppet] - 10https://gerrit.wikimedia.org/r/1166846 (https://phabricator.wikimedia.org/T398444) (owner: 10Herron) [15:12:05] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-etcd2002.codfw.wmnet with OS bookworm [15:12:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/JsonConfig] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166942 (https://phabricator.wikimedia.org/T398597) (owner: 10Bvibber) [15:12:53] (03CR) 10Hnowlan: [C:03+1] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167240 (owner: 10Muehlenhoff) [15:13:45] (03PS1) 10Zabe: Enable categorylinks read new on a few large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167241 (https://phabricator.wikimedia.org/T397912) [15:18:08] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-etcd2003.codfw.wmnet with reason: host reimage [15:18:14] (03PS1) 10Tchanders: Revert "UserLinker: remove back compat with old arguments of UserLinkRenderer" [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167243 [15:18:48] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-ats (exit_code=0) rolling restart_daemons on A:cp-magru [15:19:25] (03CR) 10Muehlenhoff: [C:03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1166846 (https://phabricator.wikimedia.org/T398444) (owner: 10Herron) [15:19:40] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-restart-ats rolling restart_daemons on A:cp-esams [15:19:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78815 and previous config saved to /var/cache/conftool/dbconfig/20250708-151939-root.json [15:20:43] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-ats (exit_code=0) rolling restart_daemons on A:cp-ulsfo [15:21:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-etcd2003.codfw.wmnet with reason: host reimage [15:22:06] (03Merged) 10jenkins-bot: Support null values in data columns in transform output [extensions/JsonConfig] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166942 (https://phabricator.wikimedia.org/T398597) (owner: 10Bvibber) [15:22:34] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1166942|Support null values in data columns in transform output (T398597)]] [15:22:39] T398597: Transformed .chart pages crash when the underlying .tab page contains null values - https://phabricator.wikimedia.org/T398597 [15:22:47] (03CR) 10Herron: [C:03+2] alerting_host: set puppet agent to 5m interval [puppet] - 10https://gerrit.wikimedia.org/r/1166846 (https://phabricator.wikimedia.org/T398444) (owner: 10Herron) [15:24:40] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1166942|Support null values in data columns in transform output (T398597)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:25:48] !log bvibber@deploy1003 bvibber: Continuing with sync [15:26:16] (03PS1) 10Tchanders: Revert "Add user-related link colors to LinkRenderer::getLinkClasses" [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167244 (https://phabricator.wikimedia.org/T392775) [15:27:58] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-etcd2002.codfw.wmnet with reason: host reimage [15:28:52] (03CR) 10CI reject: [V:04-1] Revert "UserLinker: remove back compat with old arguments of UserLinkRenderer" [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167243 (owner: 10Tchanders) [15:30:02] (03PS11) 10Volans: git::clone: remove remote_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/1148267 [15:30:10] (03CR) 10Hashar: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [15:31:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-etcd2002.codfw.wmnet with reason: host reimage [15:31:26] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166942|Support null values in data columns in transform output (T398597)]] (duration: 08m 52s) [15:31:30] T398597: Transformed .chart pages crash when the underlying .tab page contains null values - https://phabricator.wikimedia.org/T398597 [15:31:36] \o/ done [15:32:00] uh oh. whither wmopbot [15:38:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-etcd2003.codfw.wmnet with OS bookworm [15:38:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-etcd2003.codfw.wmnet [15:42:11] (03PS2) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [15:42:12] (03CR) 10Federico Ceratto: "Tests are passing. It's a simple cookbook but some care could be needed with handling failure modes around dbctl" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [15:43:01] (03PS1) 10Volans: administrative: add support for empty task ID [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167247 [15:44:11] (03CR) 10JHathaway: [C:03+1] administrative: add support for empty task ID [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167247 (owner: 10Volans) [15:44:24] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-ats (exit_code=0) rolling restart_daemons on A:cp-esams [15:48:39] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-etcd2002.codfw.wmnet with OS bookworm [15:48:39] !log btullis@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-etcd2002.codfw.wmnet [15:49:44] (03PS1) 10Hnowlan: changeprop: don't process File: pages for mobile html pages in PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167249 (https://phabricator.wikimedia.org/T397750) [15:51:46] (03CR) 10CI reject: [V:04-1] administrative: add support for empty task ID [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167247 (owner: 10Volans) [15:52:39] 10SRE-SLO, 13Patch-For-Review: Reduce the pyrra's multi-dc configurations where it makes sense - https://phabricator.wikimedia.org/T398534#10984766 (10herron) Today I reviewed a sampling of our published SLO docs and while some do make mention of 'datacenter' and specific names like 'eqiad' 'codfw', I didn't s... [15:52:59] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167247 (owner: 10Volans) [15:57:34] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-restart-ats rolling restart_daemons on A:cp-drmrs [16:00:05] jhathaway and moritzm: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1600) [16:00:05] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:22] already deployed [16:00:27] Indeed. Thanks Mortiz! [16:01:06] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10984815 (10MoritzMuehlenhoff) [16:02:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:06:59] !log btullis@cumin1003 START - Cookbook sre.ganeti.makevm for new host dse-k8s-ctrl2001.codfw.wmnet [16:07:01] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [16:07:26] !log btullis@cumin1003 START - Cookbook sre.ganeti.makevm for new host dse-k8s-ctrl2002.codfw.wmnet [16:07:54] (03CR) 10Volans: [C:03+2] administrative: add support for empty task ID [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167247 (owner: 10Volans) [16:10:57] jouncebot: nowandnext [16:10:57] For the next 0 hour(s) and 49 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1600) [16:10:58] In 0 hour(s) and 49 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1700) [16:11:21] (03PS4) 10Pppery: Catalog newsletter tables [puppet] - 10https://gerrit.wikimedia.org/r/1167252 (https://phabricator.wikimedia.org/T398941) [16:11:40] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm [16:12:03] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [16:12:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167244 (https://phabricator.wikimedia.org/T392775) (owner: 10Tchanders) [16:12:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167243 (owner: 10Tchanders) [16:12:35] btullis@cumin1003 makevm (PID 848937) is awaiting input [16:14:11] (03PS1) 10Btullis: Update the IP addresses for cephosd200[1-3] post vlan-move [puppet] - 10https://gerrit.wikimedia.org/r/1167254 (https://phabricator.wikimedia.org/T374923) [16:14:21] (03PS1) 10Máté Szabó: UpdateMessageJobTest: Read expected transver from latest [extensions/Translate] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167256 (https://phabricator.wikimedia.org/T398904) [16:14:44] (03PS2) 10Máté Szabó: Revert "UserLinker: remove back compat with old arguments of UserLinkRenderer" [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167243 (owner: 10Tchanders) [16:14:52] (03CR) 10CI reject: [V:04-1] Revert "Add user-related link colors to LinkRenderer::getLinkClasses" [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167244 (https://phabricator.wikimedia.org/T392775) (owner: 10Tchanders) [16:15:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167244 (https://phabricator.wikimedia.org/T392775) (owner: 10Tchanders) [16:15:07] (03CR) 10TrainBranchBot: "Approved by mszabo@deploy1003 using scap backport" [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167243 (owner: 10Tchanders) [16:15:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/Translate] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167256 (https://phabricator.wikimedia.org/T398904) (owner: 10Máté Szabó) [16:15:15] (03Merged) 10jenkins-bot: administrative: add support for empty task ID [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167247 (owner: 10Volans) [16:17:39] btullis@cumin1003 makevm (PID 848966) is awaiting input [16:17:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:18:52] (03CR) 10Máté Szabó: "recheck" [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167243 (owner: 10Tchanders) [16:20:07] (03PS1) 10Krinkle: beta: Remove beta-specific 'http' entry for wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167258 [16:20:07] (03PS1) 10Krinkle: beta: Move beta wikipedia canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167259 (https://phabricator.wikimedia.org/T289318) [16:21:39] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-eqsin and not P{cp[5017,5025].eqsin.wmnet} and A:cp - 2.8.15 upgrade (T398720) [16:21:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:21:42] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [16:22:37] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-ats (exit_code=0) rolling restart_daemons on A:cp-drmrs [16:23:26] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-restart-ats rolling restart_daemons on A:cp-eqiad [16:23:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd2003.codfw.wmnet with OS bookworm [16:27:04] !log Add ATS routing to profile::trafficserver::backend::mapping_rules in Hiera (Horizon pupet prefix: cache-text) for a wmcloud version of config-master.wikimedia.beta.wmflabs.org [16:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:49] (03Merged) 10jenkins-bot: UpdateMessageJobTest: Read expected transver from latest [extensions/Translate] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167256 (https://phabricator.wikimedia.org/T398904) (owner: 10Máté Szabó) [16:28:31] (03CR) 10Btullis: [C:03+2] Update the IP addresses for cephosd200[1-3] post vlan-move [puppet] - 10https://gerrit.wikimedia.org/r/1167254 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [16:28:48] (03Merged) 10jenkins-bot: Revert "UserLinker: remove back compat with old arguments of UserLinkRenderer" [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167243 (owner: 10Tchanders) [16:28:52] (03Merged) 10jenkins-bot: Revert "Add user-related link colors to LinkRenderer::getLinkClasses" [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167244 (https://phabricator.wikimedia.org/T392775) (owner: 10Tchanders) [16:30:27] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1167244|Revert "Add user-related link colors to LinkRenderer::getLinkClasses" (T392775 T398714 T398717 T398952)]], [[gerrit:1167243|Revert "UserLinker: remove back compat with old arguments of UserLinkRenderer"]], [[gerrit:1167256|UpdateMessageJobTest: Read expected transver from latest (T398904)]] [16:31:05] T392775: Add link color for temporary usernames - https://phabricator.wikimedia.org/T392775 [16:31:05] T398714: "Show IP" appearing twice - https://phabricator.wikimedia.org/T398714 [16:31:06] T398717: IPInfo button only appears for the first temporary account - https://phabricator.wikimedia.org/T398717 [16:31:06] T398952: Inconsistent/confusing styles for temporary account links - https://phabricator.wikimedia.org/T398952 [16:31:07] T398904: UpdateMessageJobTest failing as of 2025-07-07 - https://phabricator.wikimedia.org/T398904 [16:31:54] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [16:32:33] !log mszabo@deploy1003 tchanders, mszabo: Backport for [[gerrit:1167244|Revert "Add user-related link colors to LinkRenderer::getLinkClasses" (T392775 T398714 T398717 T398952)]], [[gerrit:1167243|Revert "UserLinker: remove back compat with old arguments of UserLinkRenderer"]], [[gerrit:1167256|UpdateMessageJobTest: Read expected transver from latest (T398904)]] synced to the testservers (see https://wikitech.wikimedia.org [16:32:34] /wiki/Mwdebug). Changes can now be verified there. [16:33:51] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudnet2006-dev.codfw.wmnet with OS bookworm [16:34:12] !log mszabo@deploy1003 tchanders, mszabo: Continuing with sync [16:35:16] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [16:36:30] !log cdanis@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "feat: reverse deps - cdanis@cumin1002" [16:36:31] !log cdanis@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: feat: reverse deps - cdanis@cumin1002 [16:37:00] !log cdanis@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: feat: reverse deps - cdanis@cumin1002 [16:37:01] !log cdanis@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "feat: reverse deps - cdanis@cumin1002" [16:39:38] !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167244|Revert "Add user-related link colors to LinkRenderer::getLinkClasses" (T392775 T398714 T398717 T398952)]], [[gerrit:1167243|Revert "UserLinker: remove back compat with old arguments of UserLinkRenderer"]], [[gerrit:1167256|UpdateMessageJobTest: Read expected transver from latest (T398904)]] (duration: 09m 10s) [16:39:50] T392775: Add link color for temporary usernames - https://phabricator.wikimedia.org/T392775 [16:39:52] T398714: "Show IP" appearing twice - https://phabricator.wikimedia.org/T398714 [16:39:52] T398717: IPInfo button only appears for the first temporary account - https://phabricator.wikimedia.org/T398717 [16:39:52] T398952: Inconsistent/confusing styles for temporary account links - https://phabricator.wikimedia.org/T398952 [16:39:52] T398904: UpdateMessageJobTest failing as of 2025-07-07 - https://phabricator.wikimedia.org/T398904 [16:40:39] !log dancy@deploy1003 Installing scap version "4.187.0" for 2 host(s) [16:40:59] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - bking@cumin1002 - T397227 [16:41:02] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [16:42:07] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1189 - https://phabricator.wikimedia.org/T398773#10985142 (10Jclark-ctr) @BTullis I am having issues with this server after Hard drive replacements it will not rebuild VD I did not want to clear cache it says it could cause data loss ` STOR305: Una... [16:42:29] !log dancy@deploy1003 Installation of scap version "4.187.0" completed for 2 hosts [16:42:41] (03PS1) 10CDanis: feat: Reverse Depends [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167261 [16:42:55] I lwaays forget I need to mess with the deploy repo [16:43:09] (03CR) 10CDanis: [V:03+2 C:03+2] feat: Reverse Depends [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167261 (owner: 10CDanis) [16:43:21] !log cdanis@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "feat: reverse deps - cdanis@cumin1002" [16:43:23] !log cdanis@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: feat: reverse deps - cdanis@cumin1002 [16:43:51] (03PS1) 10Cwhite: logstash: nest_root_fields.rb fix memory leak and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1167262 (https://phabricator.wikimedia.org/T398990) [16:43:53] !log cdanis@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: feat: reverse deps - cdanis@cumin1002 [16:43:54] !log cdanis@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "feat: reverse deps - cdanis@cumin1002" [16:45:00] (03PS2) 10Krinkle: beta: Move beta wikipedia canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167259 (https://phabricator.wikimedia.org/T289318) [16:46:11] (03CR) 10CI reject: [V:04-1] logstash: nest_root_fields.rb fix memory leak and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1167262 (https://phabricator.wikimedia.org/T398990) (owner: 10Cwhite) [16:48:31] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-ats (exit_code=0) rolling restart_daemons on A:cp-eqiad [16:50:05] (03PS2) 10Cwhite: logstash: nest_root_fields.rb fix memory leak and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1167262 (https://phabricator.wikimedia.org/T398990) [16:52:13] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2006-dev.codfw.wmnet with reason: host reimage [16:53:39] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm [16:56:32] (03PS3) 10Cwhite: logstash: nest_root_fields.rb fix memory leak and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1167262 (https://phabricator.wikimedia.org/T398990) [16:58:36] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2006-dev.codfw.wmnet with reason: host reimage [16:58:54] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:59:54] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1700) [17:02:52] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:02:58] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:03:40] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:03:56] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:04:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:04:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:06:28] (03CR) 10Herron: [C:03+1] "Thanks for the refactor, looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [17:09:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10985333 (10Jclark-ctr) @klausman Will this be legacy or uefi? it is reachable @elukey The first machine learning server is cabled ml-serve1012 The provisioning scr... [17:10:33] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - bking@cumin1002 - T397227 [17:10:36] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [17:10:37] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cirrussearch2076 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:10:39] !log btullis@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:10:51] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM dse-k8s-ctrl2001.codfw.wmnet - btullis@cumin1003" [17:11:34] (03CR) 10Cwhite: [C:03+2] logstash: nest_root_fields.rb fix memory leak and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1167262 (https://phabricator.wikimedia.org/T398990) (owner: 10Cwhite) [17:13:22] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [17:13:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM dse-k8s-ctrl2001.codfw.wmnet - btullis@cumin1003" [17:13:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:13:26] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-ctrl2001.codfw.wmnet on all recursors [17:13:29] FIRING: SystemdUnitFailed: push_cross_cluster_settings_9600.service on cirrussearch2076:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-ctrl2001.codfw.wmnet on all recursors [17:13:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:13:51] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:13:53] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM dse-k8s-ctrl2001.codfw.wmnet - btullis@cumin1003" [17:13:58] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM dse-k8s-ctrl2001.codfw.wmnet - btullis@cumin1003" [17:14:18] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-ctrl2001.codfw.wmnet with OS bookworm [17:14:31] (03PS4) 10Ssingh: team-traffic: add dnsbox alert for service status mismatch [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) [17:18:27] (03CR) 10Ssingh: "The label_replace mangling is intentional here so as to avoid making changes to the existing setup, both for anycast-hc and how we generat" [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [17:18:34] (03CR) 10Ssingh: "(Ready for review)" [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [17:18:50] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM dse-k8s-ctrl2002.codfw.wmnet - btullis@cumin1003" [17:18:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM dse-k8s-ctrl2002.codfw.wmnet - btullis@cumin1003" [17:18:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:18:55] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-ctrl2002.codfw.wmnet on all recursors [17:18:58] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-ctrl2002.codfw.wmnet on all recursors [17:19:21] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM dse-k8s-ctrl2002.codfw.wmnet - btullis@cumin1003" [17:21:07] 10ops-codfw, 06SRE, 06DC-Ops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10985424 (10cmooney) 05Resolved→03Open @Jhancock.wm I notice these devices are still in Netbox? https://netbox.wikimedia.org/dcim/devices/?manufacturer_id=96 Not sure... [17:21:21] (03PS1) 10Krinkle: varnish: Swap hardcoded upload.wm.o cond for upload_domain in path normalize [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) [17:21:41] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2006-dev.codfw.wmnet with OS bookworm [17:22:26] btullis@cumin1003 makevm (PID 848966) is awaiting input [17:22:38] (03PS2) 10Krinkle: varnish: Swap hardcoded upload.wm.o cond for upload_domain in path normalize [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) [17:22:44] (03PS3) 10Krinkle: varnish: Swap hardcoded upload.wm.o cond for upload_domain in path normalize [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) [17:22:45] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [17:25:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM dse-k8s-ctrl2002.codfw.wmnet - btullis@cumin1003" [17:25:41] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-ctrl2002.codfw.wmnet with OS bookworm [17:28:29] RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9600.service on cirrussearch2076:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:29:24] (03CR) 10Krinkle: deployment-prep: Add Apache vhost aliases for *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [17:29:33] (03PS4) 10Krinkle: varnish: Swap hardcoded upload.wm.o cond for upload_domain in path normalize [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) [17:30:01] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2006-dev.codfw.wmnet with OS bookworm [17:30:37] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cirrussearch2076 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:33:05] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-ctrl2001.codfw.wmnet with reason: host reimage [17:36:23] (03PS1) 10Krinkle: beta: Add redirect for upload.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1167268 (https://phabricator.wikimedia.org/T289318) [17:39:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-ctrl2001.codfw.wmnet with reason: host reimage [17:39:37] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10985509 (10cmooney) @Jhancock.wm in terms of the new Nokia switches in the expansion cage we can cable them to the spines like this: |Spine|Spine Port|Leaf|Leaf Port| |------|-------------|... [17:43:17] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-ctrl2002.codfw.wmnet with reason: host reimage [17:48:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-ctrl2002.codfw.wmnet with reason: host reimage [17:50:16] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2006-dev.codfw.wmnet with reason: host reimage [17:52:46] !log ebernhardson@deploy1003 Started deploy [airflow-dags/search@5c0689d]: sync rdf-spark-tools 0.3.158 artifacts [17:53:06] !log ebernhardson@deploy1003 Finished deploy [airflow-dags/search@5c0689d]: sync rdf-spark-tools 0.3.158 artifacts (duration: 00m 19s) [17:53:31] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2006-dev.codfw.wmnet with reason: host reimage [17:53:43] (03PS7) 10Krinkle: deployment-prep: Add Apache vhost aliases for *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) [17:55:09] (03PS4) 10Krinkle: beta: Document beta-specific "w.beta.wmcloud.org" handling [puppet] - 10https://gerrit.wikimedia.org/r/1160441 (https://phabricator.wikimedia.org/T396012) [17:55:14] (03PS5) 10Krinkle: varnish: Swap hardcoded upload.wm.o cond for upload_domain in path normalize [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) [17:55:55] (03PS6) 10Krinkle: varnish: Swap hardcoded upload.wm.o cond for upload_domain in path normalize [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) [17:55:55] (03PS2) 10Krinkle: beta: Add redirect for upload.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1167268 (https://phabricator.wikimedia.org/T289318) [17:58:20] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-ctrl2001.codfw.wmnet with OS bookworm [17:58:20] !log btullis@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-ctrl2001.codfw.wmnet [18:00:18] (03CR) 10Andrea Denisse: "I just left a small question, otherwise LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [18:04:33] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10985582 (10Dzahn) Since yesterday we are now replicating to the new machine gerrit2003 again. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153265 [18:05:05] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [18:05:47] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10985584 (10Dzahn) @ABran-WMF I wonder if you have thoughts on my original question on this ticket, back in August 2024 I said: "determine if this is res... [18:07:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-ctrl2002.codfw.wmnet with OS bookworm [18:07:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-ctrl2002.codfw.wmnet [18:09:35] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-restart-ats rolling restart_daemons on A:cp-codfw [18:11:54] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2006-dev.codfw.wmnet with OS bookworm [18:14:02] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-restart-ats rolling restart_daemons on A:cp-eqsin [18:22:32] (03CR) 10Ssingh: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:26:40] !log kcvelaga@deploy1003 Started deploy [airflow-dags/analytics_product@52ec646]: T394526 [18:26:43] T394526: Data pipeline to aggregate CX monthly machine translation service usage - https://phabricator.wikimedia.org/T394526 [18:28:11] !log kcvelaga@deploy1003 Finished deploy [airflow-dags/analytics_product@52ec646]: T394526 (duration: 01m 35s) [18:32:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:34:15] (03PS1) 10Krinkle: beta: Change beta.wmcloud.org stub redirect to new Meta-Wiki canonical [puppet] - 10https://gerrit.wikimedia.org/r/1167273 (https://phabricator.wikimedia.org/T289318) [18:34:24] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-ats (exit_code=0) rolling restart_daemons on A:cp-codfw [18:39:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-ats (exit_code=0) rolling restart_daemons on A:cp-eqsin [18:40:40] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10985734 (10aranyap) >>! In T398650#10974346, @Clement_Goubert wrote: > Please make sure the [[ https://wikitech.wikimedia.org/wiki/Help:Create_a_Wikim... [18:42:48] FIRING: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:52:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:59:08] !log bking@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for wdqs2022.codfw.wmnet: Renew puppet certificate - bking@cumin1002 [19:09:21] (03CR) 10Ssingh: [C:03+1] "I forgot to add to the +1: 0 tests failed, 0 tests skipped, 18 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [19:12:02] (03Abandoned) 10JHathaway: WIP: do not merge [cookbooks] - 10https://gerrit.wikimedia.org/r/1165598 (owner: 10JHathaway) [19:13:31] (03CR) 10Dzahn: [V:03+1 C:03+1] gerrit: avoid hardcoded hostnames, replace with hiera lookups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [19:13:35] (03PS7) 10Krinkle: varnish: Swap hardcoded upload.wm.o cond for upload_domain in path normalize [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) [19:13:56] (03CR) 10Krinkle: "@sukhe: Thanks, I've debased this from the rest of the beta stack to ease landing." [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [19:19:18] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:24:07] (03CR) 10Dzahn: [C:03+1] "I have tested this and it returns exit code 0 whether service is active or inactive. lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167226 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [19:24:43] (03CR) 10Dzahn: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [19:30:32] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10986005 (10Dzahn) a:05Arnoldokoth→03None [19:31:41] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10986023 (10Dzahn) a:03SKivlehan-WMF [19:32:28] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10986025 (10Dzahn) Still stalled. Assigned to user because we are waiting for their response. [19:35:20] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167285 [19:37:19] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167285 (owner: 10PipelineBot) [19:38:55] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167285 (owner: 10PipelineBot) [19:40:11] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [19:41:35] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [19:41:55] (03PS1) 10Xcollazo: analytics: deprioritize druid MapReduce jobs if needed [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) [19:41:57] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [19:42:21] (03CR) 10CI reject: [V:04-1] analytics: deprioritize druid MapReduce jobs if needed [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) (owner: 10Xcollazo) [19:42:46] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [19:42:55] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [19:43:32] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1234298072 and 57 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:43:44] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [19:47:32] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 483120 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:48:34] (03CR) 10Xcollazo: "@btullis@wikimedia.org can you help me with the proper `Host` definition so that we can PPC this?" [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) (owner: 10Xcollazo) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:02:04] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10986083 (10VRiley-WMF) [20:04:52] Hey all - going to use the backport window here (no changes scheduled) to get a private mitigation update deployed. [20:15:06] !log Deployed security mitigation update for T395468 [20:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:46] (03PS1) 10Cwhite: logstash: drop most mobileapps-staging outgoing request logs [puppet] - 10https://gerrit.wikimedia.org/r/1167287 (https://phabricator.wikimedia.org/T397252) [20:25:29] (03CR) 10BCornwall: [C:03+1] "Looks good, and bless you for having a runbook alongside this from the beginning." [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [20:35:59] (03PS1) 10Dzahn: rename build pipelines for sourcebot [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) [20:37:35] (03CR) 10Dzahn: [C:03+2] "still experimenting with image builds" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [20:50:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10986258 (10VRiley-WMF) es1047 cableID: 1089 port 9 Rack A3 U 5 es1048 cableID: 5180 port 27 Rack B5 U7 [20:56:00] (03CR) 10Btullis: "I think that you should be able to use:" [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) (owner: 10Xcollazo) [20:56:27] (03PS5) 10Tiziano Fogli: prom/metamonitor: add dead man switch and public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T2100) [21:02:36] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:02:51] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:04:44] (03CR) 10Cwhite: [C:03+2] logstash: drop most mobileapps-staging outgoing request logs [puppet] - 10https://gerrit.wikimedia.org/r/1167287 (https://phabricator.wikimedia.org/T397252) (owner: 10Cwhite) [21:06:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10986306 (10Jclark-ctr) ml-serve1013 is cabled ml-serve1012 manually configured the root account and password [21:13:11] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2005-dev.codfw.wmnet with OS bookworm [21:13:54] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:16:12] !log vriley@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:16:52] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:20:09] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1047 - vriley@cumin1002" [21:20:41] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1047 - vriley@cumin1002" [21:20:42] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:21:08] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:23:28] !log vriley@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:24:16] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:24:48] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host es1047 [21:26:35] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1047 [21:27:30] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host es1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:28:10] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1048 - vriley@cumin1002" [21:31:15] vriley@cumin1002 netbox (PID 3617261) is awaiting input [21:31:24] !log vriley@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1048 - vriley@cumin1002" [21:31:24] !log vriley@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:31:28] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:33:37] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2005-dev.codfw.wmnet with reason: host reimage [21:33:49] !log vriley@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:34:17] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:37:34] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cloudcephosd1048,49 - jclark@cumin1002" [21:37:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cloudcephosd1048,49 - jclark@cumin1002" [21:37:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:38:24] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1051 [21:38:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudcephosd1051 [21:38:28] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2005-dev.codfw.wmnet with reason: host reimage [21:38:31] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1049 [21:38:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1049 [21:38:46] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1048 [21:38:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1048 [21:40:27] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:40:34] (03PS1) 10Zabe: Remove stdClass type hint from ApiFeedContributions::feedItem() for now [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167296 (https://phabricator.wikimedia.org/T398925) [21:41:14] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:43:27] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:44:13] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:45:28] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host es1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:45:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10986464 (10Jclark-ctr) @dcaro @Andrew @cmooney @ayounsi I need some assistance. I need to open a block of 4x ports on cloudsw1-f4-eqiad. The least dis... [21:47:37] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1049 [21:47:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1049 [21:47:45] (03PS1) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [21:47:49] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1048 [21:47:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1048 [21:51:15] !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [21:51:55] !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [21:55:14] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [21:55:31] (03CR) 10Zabe: [C:03+2] Enable categorylinks read new on a few large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167241 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [21:55:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:56:25] (03Merged) 10jenkins-bot: Enable categorylinks read new on a few large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167241 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [21:56:51] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2005-dev.codfw.wmnet with OS bookworm [21:57:53] dancy: are you currently deploying? [21:58:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10986480 (10Jclark-ctr) [21:58:17] I am running an experiment but I can get out of your way [21:58:54] Stand by [21:59:36] zabe: All yours [21:59:51] Alright [21:59:52] Thanks [22:00:12] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:00:36] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1167241|Enable categorylinks read new on a few large wikis (T397912)]] [22:00:39] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [22:01:32] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:02:45] !log zabe@deploy1003 zabe: Backport for [[gerrit:1167241|Enable categorylinks read new on a few large wikis (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:03:41] !log zabe@deploy1003 zabe: Continuing with sync [22:05:04] (03CR) 10Zabe: [C:03+2] Remove stdClass type hint from ApiFeedContributions::feedItem() for now [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167296 (https://phabricator.wikimedia.org/T398925) (owner: 10Zabe) [22:08:55] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167241|Enable categorylinks read new on a few large wikis (T397912)]] (duration: 08m 19s) [22:08:58] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [22:09:00] (03Merged) 10jenkins-bot: Remove stdClass type hint from ApiFeedContributions::feedItem() for now [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167296 (https://phabricator.wikimedia.org/T398925) (owner: 10Zabe) [22:09:36] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1167296|Remove stdClass type hint from ApiFeedContributions::feedItem() for now (T398925)]] [22:09:39] T398925: TypeError: MediaWiki\Api\ApiFeedContributions::feedItem(): Argument #1 ($row) must be of type stdClass, Flow\Formatter\ContributionsRow given, called in /srv/mediawiki/php-1.45.0-wmf.9/includes/api/ApiFeedContributions.php on l - https://phabricator.wikimedia.org/T398925 [22:11:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10986498 (10VRiley-WMF) [22:11:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:12:11] !log zabe@deploy1003 zabe: Backport for [[gerrit:1167296|Remove stdClass type hint from ApiFeedContributions::feedItem() for now (T398925)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:12:49] (03PS1) 10Zabe: Revert "Enable categorylinks read new on a few large wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167303 [22:12:53] (03CR) 10Zabe: [C:03+2] Revert "Enable categorylinks read new on a few large wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167303 (owner: 10Zabe) [22:13:01] !log zabe@deploy1003 zabe: Continuing with sync [22:13:43] (03Merged) 10jenkins-bot: Revert "Enable categorylinks read new on a few large wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167303 (owner: 10Zabe) [22:13:56] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm [22:14:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10986503 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm [22:16:11] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host es1047.eqiad.wmnet with OS bookworm [22:16:17] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10986507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host es1047.eqiad.wmnet with OS bookworm [22:18:10] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167296|Remove stdClass type hint from ApiFeedContributions::feedItem() for now (T398925)]] (duration: 08m 33s) [22:18:13] T398925: TypeError: MediaWiki\Api\ApiFeedContributions::feedItem(): Argument #1 ($row) must be of type stdClass, Flow\Formatter\ContributionsRow given, called in /srv/mediawiki/php-1.45.0-wmf.9/includes/api/ApiFeedContributions.php on l - https://phabricator.wikimedia.org/T398925 [22:18:37] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1167303|Revert "Enable categorylinks read new on a few large wikis"]] [22:20:45] !log zabe@deploy1003 zabe: Backport for [[gerrit:1167303|Revert "Enable categorylinks read new on a few large wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:21:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:21:58] !log zabe@deploy1003 zabe: Continuing with sync [22:27:16] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167303|Revert "Enable categorylinks read new on a few large wikis"]] (duration: 08m 38s) [22:29:06] jclark@cumin1002 provision (PID 3666053) is awaiting input [22:29:35] dancy: I am done if you want to continue experimenting :) [22:30:56] thx. [22:36:11] 06SRE, 06collaboration-services, 10Release-Engineering-Team (Radar): Redirect revisions from svn.wikimedia.org to https://static-codereview.wikimedia.org - https://phabricator.wikimedia.org/T119846#10986675 (10Dzahn) I am uploading a patch for that. I did notice though that the "SVN repo browser" part is st... [22:37:24] (03PS1) 10Dzahn: redirects: update SVN rewrite rules, do not link to Phabricator anymore [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) [22:38:18] (03PS2) 10Dzahn: redirects: update SVN rewrite rules, do not link to Phabricator anymore [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) [22:38:50] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1047.eqiad.wmnet with reason: host reimage [22:43:17] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1047.eqiad.wmnet with reason: host reimage [22:53:23] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host es1048.eqiad.wmnet with OS bookworm [22:53:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10986696 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host es1048.eqiad.wmnet with OS bookworm [23:06:08] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [23:09:13] vriley@cumin1002 reimage (PID 3685715) is awaiting input [23:09:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [23:09:32] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1047.eqiad.wmnet with OS bookworm [23:09:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10986720 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host es1047.eqiad.wmnet with OS bookworm completed: - es1047 (**PASS**) - Remo... [23:10:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10986721 (10VRiley-WMF) [23:11:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:15:48] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1048.eqiad.wmnet with reason: host reimage [23:16:26] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:19:24] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10986735 (10SKivlehan-WMF) Apologies, @Dzahn ! I have requested wmf LDAP access as provided by @elukey above. Thank you. [23:19:26] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1048.eqiad.wmnet with reason: host reimage [23:34:10] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1053.eqiad.wmnet with OS bookworm [23:34:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10986762 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed... [23:38:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1167309 [23:38:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1167309 (owner: 10TrainBranchBot) [23:43:25] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [23:43:51] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [23:43:52] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1048.eqiad.wmnet with OS bookworm [23:43:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10986766 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host es1048.eqiad.wmnet with OS bookworm completed: - es1048 (**PASS**) - Remo... [23:44:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10986767 (10VRiley-WMF) [23:44:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10986768 (10VRiley-WMF) 05Open→03Resolved The servers should be all set and ready to go [23:50:39] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1167309 (owner: 10TrainBranchBot) [23:54:13] (03PS2) 10Xcollazo: analytics: deprioritize druid MapReduce jobs if needed [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) [23:54:59] (03CR) 10Xcollazo: "Thank you, added." [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) (owner: 10Xcollazo) [23:58:02] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) (owner: 10Xcollazo) [23:58:13] !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [23:58:36] !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply