[00:00:05] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:43] (SystemdUnitFailed) resolved: man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:02:00] (SystemdUnitFailed) firing: man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:35] PROBLEM - Check systemd state on puppetserver2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service,geoipupdate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:47] PROBLEM - Check systemd state on puppetserver1002 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service,geoipupdate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:19] PROBLEM - Check systemd state on puppetserver1003 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service,geoipupdate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:27] PROBLEM - Check systemd state on puppetserver2002 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service,geoipupdate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:43] (SystemdUnitFailed) resolved: man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:27] (03PS3) 10Andrew Bogott: Horizon: update build version in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/982472 (https://phabricator.wikimedia.org/T326818) [00:21:29] (03PS1) 10Andrew Bogott: Update horizon docker version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/982495 [00:22:10] (03CR) 10Andrew Bogott: [C: 03+2] Update horizon docker version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/982495 (owner: 10Andrew Bogott) [00:38:45] (Device rebooted) firing: Alert for device ps1-b1-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [00:38:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/982193 [00:38:53] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/982193 (owner: 10TrainBranchBot) [00:47:05] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:45] (Device rebooted) resolved: Device ps1-b1-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [00:58:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/982193 (owner: 10TrainBranchBot) [01:08:56] (03PS1) 10RLazarus: admin_ng: Add the sidecar-job-controller ServiceAccount [deployment-charts] - 10https://gerrit.wikimedia.org/r/982497 (https://phabricator.wikimedia.org/T348284) [01:14:53] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:16:23] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:17:10] (03CR) 10RLazarus: "Whoops, I left this out of https://gerrit.wikimedia.org/r/981704 where I incorrectly had it still in the service chart -- but of course it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982497 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [01:24:17] (03PS2) 10RLazarus: admin_ng: Add the sidecar-job-controller ServiceAccount [deployment-charts] - 10https://gerrit.wikimedia.org/r/982497 (https://phabricator.wikimedia.org/T348284) [01:58:43] (03CR) 10RLazarus: admin_ng: Add the sidecar-job-controller ServiceAccount (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/982497 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [01:59:28] (03PS1) 10Ladsgroup: docroot: Add my pgp keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982499 [01:59:44] (03PS2) 10Ladsgroup: docroot: Add my pgp key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982499 [02:17:27] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:18:56] (03CR) 10Tim Starling: [C: 03+1] "This is fine. I verified that the key is correct. However, I think it should be reverse chronological. It's not a credits file. I wasn't t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982499 (owner: 10Ladsgroup) [02:18:57] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:28:25] PROBLEM - Check systemd state on dumpsdata1003 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_tmpdumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:58] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:44] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:26:11] (03PS1) 10Jdlrobson: Restore fixed width and height, direction of arrow on change list pages [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982244 (https://phabricator.wikimedia.org/T352456) [03:27:41] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:28:16] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:29:11] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:38:05] (03CR) 10Hashar: [C: 03+2] Add a banner for the 2023 developer survey [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/974166 (https://phabricator.wikimedia.org/T351109) (owner: 10Hashar) [03:38:41] (03Merged) 10jenkins-bot: Add a banner for the 2023 developer survey [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/974166 (https://phabricator.wikimedia.org/T351109) (owner: 10Hashar) [03:38:51] I guess 5am deployment makes me eligible to join the DBA team [03:41:00] !log hashar@deploy2002 Started deploy [gerrit/gerrit@9bf8914]: Add a banner for the 2023 developer survey - T351109 [03:41:05] T351109: Add MoTD to gerrit - https://phabricator.wikimedia.org/T351109 [03:41:08] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@9bf8914]: Add a banner for the 2023 developer survey - T351109 (duration: 00m 08s) [03:46:28] (03CR) 10CI reject: [V: 04-1] Restore fixed width and height, direction of arrow on change list pages [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982244 (https://phabricator.wikimedia.org/T352456) (owner: 10Jdlrobson) [03:57:26] (03PS2) 10MilkyDefer: Enable action blocks for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981714 (https://phabricator.wikimedia.org/T353120) [05:14:51] (03CR) 10Krinkle: [C: 03+1] RunSingleJob.php: Remove overly complicated error handling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982414 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [05:53:42] (03PS1) 10Marostegui: dbproxy1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/982511 (https://phabricator.wikimedia.org/T351864) [05:54:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1021.eqiad.wmnet with OS bookworm [05:54:16] (03CR) 10Marostegui: [C: 03+2] dbproxy1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/982511 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [05:59:37] (03PS1) 10Marostegui: preseed.yaml: Do not reimage db1237 [puppet] - 10https://gerrit.wikimedia.org/r/982513 [06:02:33] (03CR) 10Marostegui: [C: 03+2] preseed.yaml: Do not reimage db1237 [puppet] - 10https://gerrit.wikimedia.org/r/982513 (owner: 10Marostegui) [06:06:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1021.eqiad.wmnet with reason: host reimage [06:10:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1021.eqiad.wmnet with reason: host reimage [06:18:07] (03CR) 10Hashar: "CI fails due to:" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [06:27:00] (03PS1) 10Marostegui: Revert "dbproxy1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/982245 [06:27:41] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/982245 (owner: 10Marostegui) [06:29:54] (03PS1) 10Marostegui: dbproxy1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/982516 (https://phabricator.wikimedia.org/T351864) [06:30:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1021.eqiad.wmnet with OS bookworm [06:30:26] (03CR) 10Marostegui: [C: 03+2] dbproxy1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/982516 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [06:38:10] (03PS4) 10Andrew Bogott: Horizon: update build version in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/982472 (https://phabricator.wikimedia.org/T326818) [06:38:12] (03PS1) 10Andrew Bogott: Update horizon version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/982517 [06:41:49] (03CR) 10Andrew Bogott: [C: 03+2] Update horizon version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/982517 (owner: 10Andrew Bogott) [06:51:55] 10SRE-swift-storage, 10MediaWiki-Uploading: "Internal error: Server failed to store temporary file" when trying to upload images with upload wizard - https://phabricator.wikimedia.org/T228929 (10John_Cummings) @tstarling I just found this phab ticket after I received the same message, is there anything I can d... [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231213T0700) [07:04:17] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:17] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:10:00] (03CR) 10Jelto: [C: 04-1] "one comment in-line" [puppet] - 10https://gerrit.wikimedia.org/r/982119 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [07:22:08] (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981714 (https://phabricator.wikimedia.org/T353120) (owner: 10MilkyDefer) [07:28:16] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:36:20] (03PS1) 10KartikMistry: Update [deployment-charts] - 10https://gerrit.wikimedia.org/r/982645 [07:40:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1020.eqiad.wmnet with OS bookworm [07:41:27] (03PS2) 10KartikMistry: Update MinT to 2023-12-12-065316-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/982645 [07:41:32] (03PS2) 10Matthias Mullie: No custom UW licensing config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979113 [07:43:01] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1211.eqiad.wmnet onto db1226.eqiad.wmnet [07:48:49] (03PS1) 10Arnaudb: mariadb: bump db1226 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/982194 (https://phabricator.wikimedia.org/T344036) [07:49:25] (03CR) 10Marostegui: [C: 03+1] "Remember you'll need to remove 10.4 packages, then merge, then run puppet" [puppet] - 10https://gerrit.wikimedia.org/r/982194 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [07:50:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1247 (re)pooling @ 10%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54345 and previous config saved to /var/cache/conftool/dbconfig/20231213-075006-arnaudb.json [07:51:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1228 (re)pooling @ 10%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54346 and previous config saved to /var/cache/conftool/dbconfig/20231213-075105-arnaudb.json [07:51:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1229 (re)pooling @ 10%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54347 and previous config saved to /var/cache/conftool/dbconfig/20231213-075123-arnaudb.json [07:53:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1020.eqiad.wmnet with reason: host reimage [07:54:05] (03CR) 10Jdlrobson: "recheck" [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982244 (https://phabricator.wikimedia.org/T352456) (owner: 10Jdlrobson) [07:56:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1020.eqiad.wmnet with reason: host reimage [08:00:04] (03PS1) 10Marostegui: Revert "dbproxy1020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/982647 [08:00:05] Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231213T0800). [08:00:05] matthiasmullie, jdlrobson, and milkydefer: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:10] o/ [08:02:06] o/ [08:02:19] o/ [08:05:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1247 (re)pooling @ 20%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54348 and previous config saved to /var/cache/conftool/dbconfig/20231213-080512-arnaudb.json [08:06:07] !log installing openssh security updates [08:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1228 (re)pooling @ 20%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54349 and previous config saved to /var/cache/conftool/dbconfig/20231213-080610-arnaudb.json [08:06:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1229 (re)pooling @ 20%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54350 and previous config saved to /var/cache/conftool/dbconfig/20231213-080628-arnaudb.json [08:11:44] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/982647 (owner: 10Marostegui) [08:12:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/981942 (owner: 10Slyngshede) [08:12:42] I can get started & deploy my own patch if no-one is doing anything yet? [08:13:56] starting now [08:14:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979113 (owner: 10Matthias Mullie) [08:15:14] (03Merged) 10jenkins-bot: No custom UW licensing config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979113 (owner: 10Matthias Mullie) [08:15:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1020.eqiad.wmnet with OS bookworm [08:16:11] !log mlitn@deploy2002 Started scap: Backport for [[gerrit:979113|No custom UW licensing config]] [08:17:50] !log mlitn@deploy2002 mlitn: Backport for [[gerrit:979113|No custom UW licensing config]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:18:24] !log mlitn@deploy2002 mlitn: Continuing with sync [08:20:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1247 (re)pooling @ 30%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54351 and previous config saved to /var/cache/conftool/dbconfig/20231213-082017-arnaudb.json [08:21:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1228 (re)pooling @ 30%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54352 and previous config saved to /var/cache/conftool/dbconfig/20231213-082115-arnaudb.json [08:21:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1229 (re)pooling @ 30%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54353 and previous config saved to /var/cache/conftool/dbconfig/20231213-082133-arnaudb.json [08:21:46] (03PS1) 10Slyngshede: C:ganeti debug pcc, DO NOT MERGE [puppet] - 10https://gerrit.wikimedia.org/r/982758 [08:22:39] (03PS1) 10Slyngshede: C:ganeti debug pcc, DO NOT MERGE [puppet] - 10https://gerrit.wikimedia.org/r/982758 [08:23:17] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982758 (owner: 10Slyngshede) [08:25:55] !log mlitn@deploy2002 Finished scap: Backport for [[gerrit:979113|No custom UW licensing config]] (duration: 09m 43s) [08:26:00] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: cr2-codfw:xe-1/0/1:1 down - https://phabricator.wikimedia.org/T353256 (10ayounsi) 05Open→03Resolved a:03ayounsi > Dear Customer, > A patch that was incorrectly connected/labelled and the tech fixed it. [08:27:09] @Jdlrobson can you self-service your patch, or do you want me to deploy it? [08:28:45] Looks like he rescheduled his patch so you might skip that [08:28:47] https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=2134971&oldid=2134952 [08:30:01] !log delete bgp group Confed_esams from cr2-drmrs - T347892 [08:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:25] @Amir1 @urbanecm (or any other regular deployer) are you around for @milkydefer's patch? I'm sure it's fine, but https://www.mediawiki.org/wiki/Manual:$wgEnablePartialActionBlocks has a "Warning: @unstable Temporary feature flag, to be removed before the release of 1.38: phab:T280532", so I'm hesitant to move it forward myself since it's something I know nothing about [08:30:26] T280532: Remove partial action blocks feature flag - https://phabricator.wikimedia.org/T280532 [08:30:40] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 106, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:31:19] (03PS2) 10Slyngshede: C:ganeti debug pcc, DO NOT MERGE [puppet] - 10https://gerrit.wikimedia.org/r/982758 [08:32:26] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982758 (owner: 10Slyngshede) [08:33:36] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [08:34:05] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [08:35:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1247 (re)pooling @ 40%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54354 and previous config saved to /var/cache/conftool/dbconfig/20231213-083522-arnaudb.json [08:35:33] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [08:36:16] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [08:36:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1228 (re)pooling @ 40%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54355 and previous config saved to /var/cache/conftool/dbconfig/20231213-083620-arnaudb.json [08:36:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1229 (re)pooling @ 40%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54356 and previous config saved to /var/cache/conftool/dbconfig/20231213-083638-arnaudb.json [08:36:47] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [08:43:43] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 46997 [08:44:01] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 46997 [08:44:38] @milkydefer I'd suggest you reschedule into another backport window if noone's around to take care of your patch by the end of this round :) [08:45:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet [08:45:21] Yeah I will wait till 10 mins before the end of the window [08:46:47] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3856 [08:48:05] !log delete bgp group Confed_drmrs from cr1-esams - T347892 [08:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3856 [08:48:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5007.eqsin.wmnet [08:49:34] (03PS1) 10Brouberol: dse-k8s limitrange: ensure pod max memory is higher than container max memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/982762 (https://phabricator.wikimedia.org/T351722) [08:50:11] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10WMDE-leszek) hi folks. @darthmon_wmde is currently off. I'll remind her of a missing ssh key once she's back in January. Stalling until then so it does not show up in your boar... [08:50:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1247 (re)pooling @ 50%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54357 and previous config saved to /var/cache/conftool/dbconfig/20231213-085027-arnaudb.json [08:51:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1228 (re)pooling @ 50%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54358 and previous config saved to /var/cache/conftool/dbconfig/20231213-085125-arnaudb.json [08:51:35] rescheduled [08:51:37] (03PS1) 10Klausman: hiera: Add amd_gpu role to workers and rocm_version to ml-staging2001 [puppet] - 10https://gerrit.wikimedia.org/r/982761 (https://phabricator.wikimedia.org/T348118) [08:51:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1229 (re)pooling @ 50%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54359 and previous config saved to /var/cache/conftool/dbconfig/20231213-085143-arnaudb.json [08:53:19] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/882/console" [puppet] - 10https://gerrit.wikimedia.org/r/982761 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [08:53:44] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 286, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:54:20] (03PS2) 10Klausman: hiera: Add amd_gpu role to staging workers and rocm_version to ml-staging2001 [puppet] - 10https://gerrit.wikimedia.org/r/982761 (https://phabricator.wikimedia.org/T348118) [08:55:08] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 202120 [08:55:11] (03CR) 10Elukey: hiera: Add amd_gpu role to staging workers and rocm_version to ml-staging2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982761 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [08:55:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 202120 [08:58:24] (03PS3) 10Klausman: Add AMD GPU configuration to ml-staging2001 [puppet] - 10https://gerrit.wikimedia.org/r/982761 (https://phabricator.wikimedia.org/T348118) [08:58:37] (03CR) 10Klausman: Add AMD GPU configuration to ml-staging2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982761 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [08:59:48] (03CR) 10Elukey: [C: 03+1] Add AMD GPU configuration to ml-staging2001 [puppet] - 10https://gerrit.wikimedia.org/r/982761 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [09:00:04] brennen and hashar: gettimeofday() says it's time for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231213T0900) [09:03:19] (03PS1) 10Filippo Giunchedi: alertmanager: show five weeks old alerts for triage [puppet] - 10https://gerrit.wikimedia.org/r/982763 [09:05:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1247 (re)pooling @ 60%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54360 and previous config saved to /var/cache/conftool/dbconfig/20231213-090534-arnaudb.json [09:06:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1228 (re)pooling @ 60%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54361 and previous config saved to /var/cache/conftool/dbconfig/20231213-090631-arnaudb.json [09:06:45] (03CR) 10Klausman: [C: 03+2] Add AMD GPU configuration to ml-staging2001 [puppet] - 10https://gerrit.wikimedia.org/r/982761 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [09:06:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1229 (re)pooling @ 60%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54362 and previous config saved to /var/cache/conftool/dbconfig/20231213-090648-arnaudb.json [09:09:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet [09:14:36] (03CR) 10Btullis: [C: 03+1] dse-k8s limitrange: ensure pod max memory is higher than container max memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/982762 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [09:20:24] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [09:20:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1247 (re)pooling @ 70%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54363 and previous config saved to /var/cache/conftool/dbconfig/20231213-092039-arnaudb.json [09:21:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1228 (re)pooling @ 70%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54364 and previous config saved to /var/cache/conftool/dbconfig/20231213-092136-arnaudb.json [09:21:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1229 (re)pooling @ 70%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54365 and previous config saved to /var/cache/conftool/dbconfig/20231213-092153-arnaudb.json [09:22:33] (03CR) 10Brouberol: [C: 03+2] dse-k8s limitrange: ensure pod max memory is higher than container max memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/982762 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [09:24:16] (03CR) 10LSobanski: [C: 03+1] alertmanager: show five weeks old alerts for triage [puppet] - 10https://gerrit.wikimedia.org/r/982763 (owner: 10Filippo Giunchedi) [09:24:59] !log increasing pod max requested memory to a higher value than the container max requested memory for dse-k8s-eqiad - T351722 [09:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:04] T351722: Create a helm chart for the spark-history service - https://phabricator.wikimedia.org/T351722 [09:25:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:25:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:25:33] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: show five weeks old alerts for triage [puppet] - 10https://gerrit.wikimedia.org/r/982763 (owner: 10Filippo Giunchedi) [09:25:34] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet [09:25:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet [09:27:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [09:30:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet [09:33:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet [09:34:55] (03PS1) 10Marostegui: pc2011: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/982765 [09:35:39] (03CR) 10Marostegui: [C: 03+2] pc2011: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/982765 (owner: 10Marostegui) [09:35:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1247 (re)pooling @ 80%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54366 and previous config saved to /var/cache/conftool/dbconfig/20231213-093544-arnaudb.json [09:36:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1228 (re)pooling @ 80%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54367 and previous config saved to /var/cache/conftool/dbconfig/20231213-093641-arnaudb.json [09:36:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1229 (re)pooling @ 80%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54368 and previous config saved to /var/cache/conftool/dbconfig/20231213-093658-arnaudb.json [09:40:00] (03PS1) 10Klausman: modules/amd_rocm: Pull firmware-amd-graphics from bpo for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/982766 (https://phabricator.wikimedia.org/T348118) [09:41:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet [09:42:02] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/982766 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [09:42:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet [09:44:55] (03CR) 10Arnaudb: [C: 03+2] mariadb: bump db1226 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/982194 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:49:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet [09:50:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1247 (re)pooling @ 90%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54369 and previous config saved to /var/cache/conftool/dbconfig/20231213-095049-arnaudb.json [09:51:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1228 (re)pooling @ 90%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54370 and previous config saved to /var/cache/conftool/dbconfig/20231213-095146-arnaudb.json [09:52:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1229 (re)pooling @ 90%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54371 and previous config saved to /var/cache/conftool/dbconfig/20231213-095203-arnaudb.json [09:52:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5005.eqsin.wmnet [09:54:47] (03CR) 10Jelto: [C: 03+1] contint: rename jenkins-slave to jenkins-agent [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:55:56] (03CR) 10Jelto: [C: 03+2] contint: rename jenkins-slave to jenkins-agent [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:56:29] !log Disabled puppet agent on contint1002, contint2002, releases1003 and releases2003 to progressively deploy https://gerrit.wikimedia.org/r/922555 [09:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:36] (03CR) 10Elukey: [C: 04-1] modules/amd_rocm: Pull firmware-amd-graphics from bpo for Bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982766 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [09:57:59] !log arnaudb@cumin1001 START - Cookbook sre.hosts.reimage for host db1226.eqiad.wmnet with OS bookworm [09:57:59] (03CR) 10Muehlenhoff: modules/amd_rocm: Pull firmware-amd-graphics from bpo for Bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982766 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [09:59:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5005.eqsin.wmnet [10:00:23] !log failover ganeti master in eqsin to ganeti5007 [10:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:12] (03PS1) 10Brouberol: spark-history: define helmfile configuration and release values [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) [10:02:37] (03CR) 10Elukey: [C: 04-1] modules/amd_rocm: Pull firmware-amd-graphics from bpo for Bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982766 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [10:03:36] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [10:04:27] (03CR) 10Klausman: [V: 03+1] modules/amd_rocm: Pull firmware-amd-graphics from bpo for Bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982766 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [10:05:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1247 (re)pooling @ 100%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54372 and previous config saved to /var/cache/conftool/dbconfig/20231213-100555-arnaudb.json [10:06:31] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] prometheus-php-fpm-exporter: Bullseye update and fix build script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982089 (owner: 10Clément Goubert) [10:06:40] (03PS2) 10Klausman: modules/amd_rocm: Pull firmware-amd-graphics from bpo for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/982766 (https://phabricator.wikimedia.org/T348118) [10:06:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1228 (re)pooling @ 100%: Post clone repooling', diff saved to https://phabricator.wikimedia.org/P54373 and previous config saved to /var/cache/conftool/dbconfig/20231213-100651-arnaudb.json [10:07:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1229 (re)pooling @ 100%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54374 and previous config saved to /var/cache/conftool/dbconfig/20231213-100708-arnaudb.json [10:09:03] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/982766 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [10:09:46] (03CR) 10Btullis: spark-history: define helmfile configuration and release values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) (owner: 10Brouberol) [10:10:40] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1226.eqiad.wmnet with reason: host reimage [10:11:01] (03CR) 10Clément Goubert: [C: 03+2] mw-debug: update php-fpm-exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982430 (owner: 10Clément Goubert) [10:11:03] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy nllb cpu version on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/982770 (https://phabricator.wikimedia.org/T351740) [10:11:10] !log hashar@deploy2002 Started deploy [releng/jenkins-deploy@77b3681] (releasing): Rename jenkins-slave to jenkins-agent - T254646 [10:11:14] T254646: Reconsidering how we name things - https://phabricator.wikimedia.org/T254646 [10:11:51] (03Merged) 10jenkins-bot: mw-debug: update php-fpm-exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982430 (owner: 10Clément Goubert) [10:11:53] !log hashar@deploy2002 Finished deploy [releng/jenkins-deploy@77b3681] (releasing): Rename jenkins-slave to jenkins-agent - T254646 (duration: 00m 42s) [10:12:45] (03CR) 10Elukey: "Looks good, but please remember that puppet deploys resources as they are declared in the manifest. In this case we are moving the package" [puppet] - 10https://gerrit.wikimedia.org/r/982766 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [10:13:32] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1226.eqiad.wmnet with reason: host reimage [10:13:53] (03CR) 10Hashar: "I have migrated all the Jenkins production hosts successfully :)" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [10:18:04] (03PS2) 10Brouberol: spark-history: define helmfile configuration and release values [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) [10:18:33] (03CR) 10Elukey: [C: 03+1] ml-services: deploy nllb cpu version on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/982770 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [10:20:28] (03PS2) 10Clément Goubert: mw-on-k8s: update php-fpm-exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982431 [10:20:30] (03PS2) 10Clément Goubert: shellbox: update php-fpm-exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982432 [10:20:32] (03PS1) 10Clément Goubert: mw-debug: Fix exporter settings yaml structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/982771 [10:20:41] (03CR) 10Brouberol: spark-history: define helmfile configuration and release values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/982769 (https://phabricator.wikimedia.org/T352860) (owner: 10Brouberol) [10:22:15] (03CR) 10Clément Goubert: [C: 03+2] mw-debug: Fix exporter settings yaml structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/982771 (owner: 10Clément Goubert) [10:22:55] (03Merged) 10jenkins-bot: mw-debug: Fix exporter settings yaml structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/982771 (owner: 10Clément Goubert) [10:23:09] (03PS2) 10Ilias Sarantopoulos: ml-services: deploy nllb cpu version on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/982770 (https://phabricator.wikimedia.org/T351740) [10:24:14] !log Updating mw-debug prometheus-php-fpm-exporter to 0.0.3 [10:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:20] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:24:55] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:26:23] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/982766 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [10:26:32] (03CR) 10Klausman: [V: 03+1] modules/amd_rocm: Pull firmware-amd-graphics from bpo for Bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982766 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [10:28:06] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy nllb cpu version on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/982770 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [10:28:27] (03CR) 10Btullis: [C: 03+1] druid: remove druid100[4-6] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/974120 (owner: 10Stevemunene) [10:28:59] (03Merged) 10jenkins-bot: ml-services: deploy nllb cpu version on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/982770 (https://phabricator.wikimedia.org/T351740) (owner: 10Ilias Sarantopoulos) [10:30:13] (03PS4) 10Brouberol: Configure the Spark History server host for the an-test yarn [puppet] - 10https://gerrit.wikimedia.org/r/981949 (https://phabricator.wikimedia.org/T352863) [10:30:15] (03PS4) 10Brouberol: Configure the Spark History server host for the analytics yarn [puppet] - 10https://gerrit.wikimedia.org/r/981950 (https://phabricator.wikimedia.org/T352863) [10:31:14] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [10:33:22] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1226.eqiad.wmnet with OS bookworm [10:40:12] 10SRE-swift-storage, 10MediaWiki-Uploading: "Internal error: Server failed to store temporary file" when trying to upload images with upload wizard - https://phabricator.wikimedia.org/T228929 (10tstarling) >>! In T228929#9401948, @John_Cummings wrote: > @tstarling I just found this phab ticket after I received... [10:41:23] (03PS1) 10Peter Fischer: Bump services/cirrus-streaming-updater version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982774 [10:44:13] (03CR) 10DCausse: [C: 03+1] Bump services/cirrus-streaming-updater version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982774 (owner: 10Peter Fischer) [10:45:29] (03CR) 10Elukey: [C: 03+1] modules/amd_rocm: Pull firmware-amd-graphics from bpo for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/982766 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [10:46:09] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: provisionning db1226.eqiad.wmnet - T344036 [10:46:13] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: provisionning db1226.eqiad.wmnet - T344036 [10:46:13] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [10:46:16] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: provisionning db1226.eqiad.wmnet - T344036 [10:46:20] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: provisionning db1226.eqiad.wmnet - T344036 [10:46:29] (03CR) 10Klausman: [V: 03+1 C: 03+2] modules/amd_rocm: Pull firmware-amd-graphics from bpo for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/982766 (https://phabricator.wikimedia.org/T348118) (owner: 10Klausman) [10:48:01] 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting: Puppet: forbid new Python2 code - https://phabricator.wikimedia.org/T197804 (10taavi) [10:48:44] (03CR) 10David Caro: P:wmcs::metricsinfra: add meta monitoring app skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [10:48:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet [10:49:17] (03Abandoned) 10Majavah: toolforge: Port portgrabber related code to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/566491 (https://phabricator.wikimedia.org/T218427) (owner: 10Legoktm) [10:49:24] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1211.eqiad.wmnet onto db1226.eqiad.wmnet [10:50:09] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [10:54:32] (03PS1) 10Arnaudb: mariadb: toggle notifications [puppet] - 10https://gerrit.wikimedia.org/r/982195 (https://phabricator.wikimedia.org/T344036) [10:55:00] (03CR) 10Marostegui: [C: 03+1] "I assume they are all green on icinga" [puppet] - 10https://gerrit.wikimedia.org/r/982195 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [11:00:03] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231213T1100) [11:00:24] (03CR) 10Hnowlan: RunSingleJob.php: Remove overly complicated error handling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982414 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [11:01:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet [11:05:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [11:13:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet [11:15:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [11:16:19] (03PS1) 10Abijeet Patro: Utilities/Yaml: Use string as value with ini_set [extensions/Translate] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982653 (https://phabricator.wikimedia.org/T348496) [11:24:14] (03CR) 10Btullis: [C: 03+1] Configure the Spark History server host for the an-test yarn [puppet] - 10https://gerrit.wikimedia.org/r/981949 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [11:25:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet [11:25:31] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications [puppet] - 10https://gerrit.wikimedia.org/r/982195 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [11:25:34] (03CR) 10Btullis: [C: 03+1] "Looks good. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/981948 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [11:26:13] (03PS1) 10Muehlenhoff: Add Host host entries to configure stat1010/1011 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982779 (https://phabricator.wikimedia.org/T349619) [11:28:16] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:30:04] (03CR) 10Peter Fischer: [C: 03+2] Bump services/cirrus-streaming-updater version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982774 (owner: 10Peter Fischer) [11:30:49] (03Merged) 10jenkins-bot: Bump services/cirrus-streaming-updater version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982774 (owner: 10Peter Fischer) [11:32:38] (03CR) 10Muehlenhoff: [C: 03+2] Add Host host entries to configure stat1010/1011 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982779 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:33:34] !log arnaudb@cumin1001 START - Cookbook sre.hosts.reimage for host db1233.eqiad.wmnet with OS bookworm [11:36:29] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:37:07] (03PS1) 10Muehlenhoff: Configure lists2001 to Puppet 7 via Hiera host entry [puppet] - 10https://gerrit.wikimedia.org/r/982780 (https://phabricator.wikimedia.org/T349619) [11:37:25] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:38:02] (03CR) 10Majavah: [C: 04-2] "Needs approval from Niharika first as I commented on the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981714 (https://phabricator.wikimedia.org/T353120) (owner: 10MilkyDefer) [11:40:12] (03CR) 10Muehlenhoff: [C: 03+2] Configure lists2001 to Puppet 7 via Hiera host entry [puppet] - 10https://gerrit.wikimedia.org/r/982780 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:43:13] (03PS1) 10Arnaudb: mariadb: productionize db1232, db1233, db1248 [puppet] - 10https://gerrit.wikimedia.org/r/982196 (https://phabricator.wikimedia.org/T344036) [11:46:23] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1233.eqiad.wmnet with reason: host reimage [11:49:29] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1233.eqiad.wmnet with reason: host reimage [12:02:42] !log setting cp4037 as inactive - T352876 [12:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:49] T352876: cp4037 reimage for cookbook getting stuck at PXE boot - https://phabricator.wikimedia.org/T352876 [12:03:28] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm:deployment restart Bitu on configuration changes. [puppet] - 10https://gerrit.wikimedia.org/r/981942 (owner: 10Slyngshede) [12:03:40] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1233.eqiad.wmnet with OS bookworm [12:04:10] (03CR) 10Marostegui: [C: 03+1] mariadb: productionize db1232, db1233, db1248 [puppet] - 10https://gerrit.wikimedia.org/r/982196 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [12:08:55] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:08:58] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:09:12] 10SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137 (10Yann) I got these messages `Some or all of the undeletion failed: The file "mwstore://local-multiwrite/local-public/b/ba/Update_40220_Overview_-_Age_of_Emp... [12:09:57] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:10:02] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:11:23] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:11:27] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:16:03] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:16:17] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:16:22] 10SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137 (10Yann) Now I got while undeleting the same batch https://commons.wikimedia.org/wiki/Commons:Deletion_requests/Files_found_with_insource:youtube.com/user/offi... [12:19:38] (03PS1) 10Muehlenhoff: Configure debmonitor2003 to Puppet 7 via Hiera host entry [puppet] - 10https://gerrit.wikimedia.org/r/982784 (https://phabricator.wikimedia.org/T349619) [12:20:47] (03PS1) 10Effie Mouzeli: (WIP 2) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/982785 (https://phabricator.wikimedia.org/T346690) [12:21:08] 10SRE, 10SRE-Access-Requests: Requesting access to Releng for sandeeps - https://phabricator.wikimedia.org/T353186 (10Sandeeps) @jhathaway Thank you for the update, I will submit a Gerrit patch with my SSH key for verification shortly. [12:21:59] (03CR) 10Muehlenhoff: [C: 03+2] Configure debmonitor2003 to Puppet 7 via Hiera host entry [puppet] - 10https://gerrit.wikimedia.org/r/982784 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:22:06] (03CR) 10JMeybohm: [C: 03+1] admin_ng: Add the sidecar-job-controller ServiceAccount [deployment-charts] - 10https://gerrit.wikimedia.org/r/982497 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [12:25:00] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:25:10] (03PS2) 10Effie Mouzeli: (WIP 2) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/982785 (https://phabricator.wikimedia.org/T346690) [12:25:35] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:32:21] (03PS1) 10Majavah: P:toolforge::checker: remove kubernetes node readiness check [puppet] - 10https://gerrit.wikimedia.org/r/982786 (https://phabricator.wikimedia.org/T313030) [12:40:14] !log installing OpenSSH security updates on bullseye [12:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:22] (03CR) 10Brouberol: [V: 03+1 C: 03+2] [yarn] Add the option to configure the spark history server address [puppet] - 10https://gerrit.wikimedia.org/r/981948 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [12:42:41] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:42:54] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:44:02] (03CR) 10Brouberol: [C: 03+2] Configure the Spark History server host for the an-test yarn [puppet] - 10https://gerrit.wikimedia.org/r/981949 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [12:50:09] (03PS1) 10Majavah: team-wmcs: metricsinfra: page when alertmanager is unreachable [alerts] - 10https://gerrit.wikimedia.org/r/982788 (https://phabricator.wikimedia.org/T288053) [12:55:44] 10SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137 (10Yann) Again `Some or all of the undeletion failed: The file "mwstore://local-multiwrite/local-public/c/c6/Logo_AoE_III_DE_-_Mexico_Civilization_02.png" is i... [12:55:49] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1211.eqiad.wmnet onto db1226.eqiad.wmnet [13:02:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice" [puppet] - 10https://gerrit.wikimedia.org/r/982786 (https://phabricator.wikimedia.org/T313030) (owner: 10Majavah) [13:02:23] (03CR) 10Majavah: [C: 03+2] P:toolforge::checker: remove kubernetes node readiness check [puppet] - 10https://gerrit.wikimedia.org/r/982786 (https://phabricator.wikimedia.org/T313030) (owner: 10Majavah) [13:04:13] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [13:04:37] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete [13:04:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice" [alerts] - 10https://gerrit.wikimedia.org/r/982788 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [13:05:22] !log delete raw replica blocks for prometheus/ops (only one replica) in eqiad - T351927 [13:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:27] T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927 [13:07:44] (03CR) 10Muehlenhoff: "One comment inline, looks good otherwise." [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [13:08:47] (03PS1) 10Clément Goubert: docker-report: Fix stretch images regex [puppet] - 10https://gerrit.wikimedia.org/r/982793 (https://phabricator.wikimedia.org/T348876) [13:09:16] (03PS2) 10Hnowlan: trafficserver: route all requests for /api/rest_v1/metrics to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/966885 (https://phabricator.wikimedia.org/T336385) [13:10:57] (03PS1) 10Slyngshede: Debian packaging configuration [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/982794 [13:12:20] (03PS1) 10Ilias Sarantopoulos: ml-services: update llm and readability images [deployment-charts] - 10https://gerrit.wikimedia.org/r/982795 (https://phabricator.wikimedia.org/T352834) [13:12:35] (03CR) 10Arnaudb: [C: 03+2] mariadb: productionize db1232, db1233, db1248 [puppet] - 10https://gerrit.wikimedia.org/r/982196 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [13:12:54] (03Abandoned) 10Slyngshede: Debian packaging configuration [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/982794 (owner: 10Slyngshede) [13:15:57] (03PS2) 10Slyngshede: Debian packaging configuration [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 [13:16:25] (03CR) 10Slyngshede: Debian packaging configuration (031 comment) [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [13:18:42] (03CR) 10Hnowlan: [C: 03+1] docker-report: Fix stretch images regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982793 (https://phabricator.wikimedia.org/T348876) (owner: 10Clément Goubert) [13:19:10] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:21:27] (03CR) 10David Caro: team-wmcs: metricsinfra: page when alertmanager is unreachable (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/982788 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [13:23:36] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: provisionning db1232.eqiad.wmnet - T344036 [13:23:42] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [13:23:52] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: provisionning db1232.eqiad.wmnet - T344036 [13:23:55] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1232.eqiad.wmnet with reason: provisionning db1232.eqiad.wmnet - T344036 [13:24:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1232.eqiad.wmnet with reason: provisionning db1232.eqiad.wmnet - T344036 [13:25:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db1132 in db1232 for T344036', diff saved to https://phabricator.wikimedia.org/P54376 and previous config saved to /var/cache/conftool/dbconfig/20231213-132511-arnaudb.json [13:27:13] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1132.eqiad.wmnet onto db1232.eqiad.wmnet [13:27:21] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: update llm and readability images [deployment-charts] - 10https://gerrit.wikimedia.org/r/982795 (https://phabricator.wikimedia.org/T352834) (owner: 10Ilias Sarantopoulos) [13:27:39] (03CR) 10Klausman: [C: 03+1] ml-services: update llm and readability images [deployment-charts] - 10https://gerrit.wikimedia.org/r/982795 (https://phabricator.wikimedia.org/T352834) (owner: 10Ilias Sarantopoulos) [13:27:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [13:28:21] (03PS2) 10Majavah: team-wmcs: metricsinfra: page when alertmanager is unreachable [alerts] - 10https://gerrit.wikimedia.org/r/982788 (https://phabricator.wikimedia.org/T288053) [13:29:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:29:29] (03CR) 10Majavah: [C: 03+2] team-wmcs: metricsinfra: page when alertmanager is unreachable (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/982788 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [13:30:50] (03Merged) 10jenkins-bot: team-wmcs: metricsinfra: page when alertmanager is unreachable [alerts] - 10https://gerrit.wikimedia.org/r/982788 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [13:34:02] (03PS1) 10Brouberol: Revert "Configure the Spark History server host for the an-test yarn" [puppet] - 10https://gerrit.wikimedia.org/r/982656 [13:34:19] (03PS1) 10Brouberol: Revert "[yarn] Add the option to configure the spark history server address" [puppet] - 10https://gerrit.wikimedia.org/r/982657 [13:36:32] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update llm and readability images [deployment-charts] - 10https://gerrit.wikimedia.org/r/982795 (https://phabricator.wikimedia.org/T352834) (owner: 10Ilias Sarantopoulos) [13:37:39] (03Merged) 10jenkins-bot: ml-services: update llm and readability images [deployment-charts] - 10https://gerrit.wikimedia.org/r/982795 (https://phabricator.wikimedia.org/T352834) (owner: 10Ilias Sarantopoulos) [13:39:58] (03CR) 10David Caro: team-wmcs: metricsinfra: page when alertmanager is unreachable (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/982788 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [13:40:35] (03CR) 10Btullis: [C: 03+1] Revert "Configure the Spark History server host for the an-test yarn" [puppet] - 10https://gerrit.wikimedia.org/r/982656 (owner: 10Brouberol) [13:41:00] (03CR) 10Btullis: [C: 03+1] Revert "[yarn] Add the option to configure the spark history server address" [puppet] - 10https://gerrit.wikimedia.org/r/982657 (owner: 10Brouberol) [13:44:36] (03PS2) 10Brouberol: Revert "Configure the Spark History server host for the an-test yarn" [puppet] - 10https://gerrit.wikimedia.org/r/982656 (https://phabricator.wikimedia.org/T352863) [13:44:44] (03PS2) 10Brouberol: Revert "[yarn] Add the option to configure the spark history server address" [puppet] - 10https://gerrit.wikimedia.org/r/982657 (https://phabricator.wikimedia.org/T352863) [13:44:49] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: provisionning db1233.eqiad.wmnet - T344036 [13:44:52] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: provisionning db1233.eqiad.wmnet - T344036 [13:44:53] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [13:44:56] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: provisionning db1233.eqiad.wmnet - T344036 [13:45:06] (03CR) 10CI reject: [V: 04-1] Revert "Configure the Spark History server host for the an-test yarn" [puppet] - 10https://gerrit.wikimedia.org/r/982656 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [13:45:23] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: provisionning db1233.eqiad.wmnet - T344036 [13:46:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db1129 in db1233 for T344036', diff saved to https://phabricator.wikimedia.org/P54379 and previous config saved to /var/cache/conftool/dbconfig/20231213-134632-arnaudb.json [13:46:39] (03PS3) 10Brouberol: Revert "Configure the Spark History server host for the an-test yarn" [puppet] - 10https://gerrit.wikimedia.org/r/982656 (https://phabricator.wikimedia.org/T352863) [13:46:46] (03PS3) 10Brouberol: Revert "[yarn] Add the option to configure the spark history server address" [puppet] - 10https://gerrit.wikimedia.org/r/982657 (https://phabricator.wikimedia.org/T352863) [13:47:31] (03PS1) 10Brouberol: spark3: add option to specify spark history server address to yarn [puppet] - 10https://gerrit.wikimedia.org/r/982797 (https://phabricator.wikimedia.org/T352863) [13:47:33] (03PS1) 10Brouberol: spark3: Specify the history server endoint for the test-analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/982798 (https://phabricator.wikimedia.org/T352863) [13:48:50] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1129.eqiad.wmnet onto db1233.eqiad.wmnet [13:48:56] (03PS2) 10Brouberol: spark3: Specify the history server endoint for the test-analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/982798 (https://phabricator.wikimedia.org/T352863) [13:49:16] (03CR) 10Brouberol: [C: 03+2] Revert "Configure the Spark History server host for the an-test yarn" [puppet] - 10https://gerrit.wikimedia.org/r/982656 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [13:49:29] (03PS4) 10Brouberol: Revert "[yarn] Add the option to configure the spark history server address" [puppet] - 10https://gerrit.wikimedia.org/r/982657 (https://phabricator.wikimedia.org/T352863) [13:49:32] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1129.eqiad.wmnet onto db1233.eqiad.wmnet [13:50:09] !log installing postgresql-11 security updates [13:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:24] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1129.eqiad.wmnet onto db1233.eqiad.wmnet [13:51:39] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1129.eqiad.wmnet onto db1233.eqiad.wmnet [13:52:42] (03CR) 10Brouberol: [C: 03+2] Revert "[yarn] Add the option to configure the spark history server address" [puppet] - 10https://gerrit.wikimedia.org/r/982657 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [13:53:30] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [13:53:34] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1129.eqiad.wmnet onto db1233.eqiad.wmnet [13:54:18] (03PS3) 10Effie Mouzeli: (WIP 2) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/982785 (https://phabricator.wikimedia.org/T346690) [13:55:43] (03CR) 10Btullis: [C: 03+1] spark3: add option to specify spark history server address to yarn [puppet] - 10https://gerrit.wikimedia.org/r/982797 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [13:57:50] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: provisionning db1248.eqiad.wmnet - T344036 [13:57:54] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [13:58:10] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: provisionning db1248.eqiad.wmnet - T344036 [13:58:25] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: provisionning db1248.eqiad.wmnet - T344036 [13:58:45] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982797 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [13:58:51] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982798 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231213T1400) [14:00:05] Dreamy_Jazz and abijeet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db1148 in db1248 for T344036', diff saved to https://phabricator.wikimedia.org/P54380 and previous config saved to /var/cache/conftool/dbconfig/20231213-140017-arnaudb.json [14:00:40] o/ [14:01:01] (03CR) 10Vgutierrez: [C: 03+2] traffic: Alert on configured and observed MSS mismatch [alerts] - 10https://gerrit.wikimedia.org/r/980280 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:01:42] I can deploy if Dreamy_Jazz and/or abijeet are around [14:01:53] (03PS1) 10Slyngshede: Changes to Python infrastucture to help building Debian package. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 [14:02:14] (03CR) 10Brouberol: [C: 03+2] spark3: add option to specify spark history server address to yarn [puppet] - 10https://gerrit.wikimedia.org/r/982797 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [14:02:15] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1148.eqiad.wmnet onto db1248.eqiad.wmnet [14:04:26] (03CR) 10Brouberol: [C: 03+2] spark3: Specify the history server endoint for the test-analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/982798 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [14:06:08] Lucas_WMDE, hello, thanks! [14:06:49] ok! [14:07:26] (03PS1) 10Muehlenhoff: Create initial stub role for logging-hd and configure for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982801 (https://phabricator.wikimedia.org/T352517) [14:07:38] o_O german error message in logspam-watch (“Der Pfad ist nicht vorhanden”) [14:07:56] (03CR) 10CI reject: [V: 04-1] Create initial stub role for logging-hd and configure for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982801 (https://phabricator.wikimedia.org/T352517) (owner: 10Muehlenhoff) [14:09:51] (03CR) 10Volans: [C: 04-1] "Much better, left some comments inline." [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [14:10:00] 10ops-esams: Port with no description on access switch - https://phabricator.wikimedia.org/T344633 (10ayounsi) 05Open→03Resolved a:03ayounsi [14:10:27] I’m confused by the old code in that Translate change [14:10:44] “scalar okay with php8.1” sure, but we also want to support older PHP versions? where this was not okay? [14:11:06] but ok ^^ [14:11:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Translate] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982653 (https://phabricator.wikimedia.org/T348496) (owner: 10Abijeet Patro) [14:12:26] ETA 17 minutes? [14:12:28] ok ._. [14:12:43] Haha, yea CI for Translate is slow [14:12:54] I mean, same for Wikibase ^^ [14:13:02] I was hoping it might be one of those lucky extensions where it’s fast :D [14:18:12] (03Abandoned) 10Elukey: profile::pyrra::filesystem: new Lift Wing pilot candidate [puppet] - 10https://gerrit.wikimedia.org/r/975833 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [14:18:21] (03Abandoned) 10Elukey: profile::thanos: change increase() range for Lift Wing [puppet] - 10https://gerrit.wikimedia.org/r/975846 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [14:18:40] (03Abandoned) 10Elukey: profile::cache::kafka::webrequest: allow to customize the format [puppet] - 10https://gerrit.wikimedia.org/r/980911 (https://phabricator.wikimedia.org/T346463) (owner: 10Elukey) [14:19:21] (03CR) 10Elukey: [C: 03+1] webrequest varnishkafka - Add to X-Analytics the Sec-Purpose HTTP header [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [14:19:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet [14:21:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Post clone (source of db1226) repooling', diff saved to https://phabricator.wikimedia.org/P54381 and previous config saved to /var/cache/conftool/dbconfig/20231213-142111-arnaudb.json [14:22:08] (03PS1) 10Arnaudb: mariadb: toggle notifications for db1211 [puppet] - 10https://gerrit.wikimedia.org/r/982197 (https://phabricator.wikimedia.org/T344036) [14:22:49] (03PS1) 10Elukey: ml-service: deploy new Docker image for article-descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/982803 (https://phabricator.wikimedia.org/T343123) [14:22:52] (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications for db1211 [puppet] - 10https://gerrit.wikimedia.org/r/982197 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [14:23:12] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications for db1211 [puppet] - 10https://gerrit.wikimedia.org/r/982197 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [14:24:54] (03CR) 10Kevin Bazira: [C: 03+1] ml-service: deploy new Docker image for article-descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/982803 (https://phabricator.wikimedia.org/T343123) (owner: 10Elukey) [14:25:08] (03PS2) 10Elukey: ml-services: deploy new Docker image for article-descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/982803 (https://phabricator.wikimedia.org/T343123) [14:25:44] Lucas_WMDE: Are you still around for the window? Didn't see the ping. [14:26:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1031.eqiad.wmnet [14:26:50] Dreamy_Jazz: still around and waiting for gate-and-submit-wmf [14:28:12] (03CR) 10Elukey: [C: 03+2] ml-services: deploy new Docker image for article-descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/982803 (https://phabricator.wikimedia.org/T343123) (owner: 10Elukey) [14:28:53] 10ops-eqiad: SMART errors on ganeti1031 - https://phabricator.wikimedia.org/T353324 (10MoritzMuehlenhoff) [14:29:39] Okay. If you have time for my config change, I will be unable to test it as it requires having CU on a group1 wiki. test2wiki exists, but I think it should be okay to make the change and I monitor logstash for issues. [14:30:06] (03Merged) 10jenkins-bot: Utilities/Yaml: Use string as value with ini_set [extensions/Translate] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982653 (https://phabricator.wikimedia.org/T348496) (owner: 10Abijeet Patro) [14:30:32] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:982653|Utilities/Yaml: Use string as value with ini_set (T348496)]] [14:30:36] T348496: Modernize code under util/ directory starting with T* - https://phabricator.wikimedia.org/T348496 [14:30:48] (03CR) 10Jforrester: [C: 03+1] docroot: Add my pgp key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982499 (owner: 10Ladsgroup) [14:31:41] Dreamy_Jazz: ack, yeah I think that should be okay [14:31:59] (03CR) 10Jforrester: "Should we move these to historical for authenticating ancient releases? I suppose that's a usage we don't really want to encourage…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981583 (owner: 10Brion VIBBER) [14:32:24] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and abi: Backport for [[gerrit:982653|Utilities/Yaml: Use string as value with ini_set (T348496)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:32:38] abijeet: can you test the change? [14:32:42] (03PS5) 10Andrew Bogott: Horizon: update build version in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/982472 (https://phabricator.wikimedia.org/T326818) [14:32:44] (03PS1) 10Andrew Bogott: Horizon local_settings: minor comment cleanup [puppet] - 10https://gerrit.wikimedia.org/r/982804 [14:33:01] (trigger the code and check that no warning appears in mwdebug logstash, I guess?) [14:33:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] service_proxy/mesh: Bump to newer version globally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [14:33:22] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:33:30] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update build version in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/982472 (https://phabricator.wikimedia.org/T326818) (owner: 10Andrew Bogott) [14:33:43] (03CR) 10Andrew Bogott: [C: 03+2] Horizon local_settings: minor comment cleanup [puppet] - 10https://gerrit.wikimedia.org/r/982804 (owner: 10Andrew Bogott) [14:34:38] Lucas_WMDE, I can do a sanity check. [14:34:47] ok thanks [14:36:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Post clone (source of db1226) repooling', diff saved to https://phabricator.wikimedia.org/P54383 and previous config saved to /var/cache/conftool/dbconfig/20231213-143616-arnaudb.json [14:37:53] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T349830 (10ayounsi) 05Open→03Resolved a:03ayounsi Fixed [14:38:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:40] (03PS1) 10Bking: extra-plugins: Fix jar hell issue [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/982805 (https://phabricator.wikimedia.org/T353270) [14:40:19] (03CR) 10Gehel: [C: 03+2] extra-plugins: Fix jar hell issue [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/982805 (https://phabricator.wikimedia.org/T353270) (owner: 10Bking) [14:40:45] (03PS1) 10Arnaudb: mariadb: decommission db1128 db1129 db1147 [puppet] - 10https://gerrit.wikimedia.org/r/982198 (https://phabricator.wikimedia.org/T350458) [14:41:37] I am restarting Gerrit [14:41:40] abijeet: are you still checking? [14:42:15] !log Restarted Gerrit on gerrit1003 and gerrit2002 [14:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:22] Lucas_WMDE, it looks good. [14:42:27] alright, thanks! [14:42:28] Lucas_WMDE, thanks for waiting [14:42:31] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and abi: Continuing with sync [14:43:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:43:34] (03CR) 10Marostegui: [C: 03+1] mariadb: decommission db1128 db1129 db1147 [puppet] - 10https://gerrit.wikimedia.org/r/982198 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [14:43:47] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:43:51] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:45:15] (03PS2) 10Muehlenhoff: Create initial stub role for logging-hd and configure for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982801 (https://phabricator.wikimedia.org/T352517) [14:45:22] 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) >>! In T351710#9390698, @fgiunchedi wrote: >>>! In T351710#9385748, @Stashbot wrote: >> {nav icon=file, name=Mentioned in SAL (#wikimedia-oper... [14:47:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982801 (https://phabricator.wikimedia.org/T352517) (owner: 10Muehlenhoff) [14:47:37] (03PS2) 10Lucas Werkmeister (WMDE): CheckUser: Enable read new for event tables migration on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982105 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [14:47:52] (03CR) 10JMeybohm: docker-report: Fix stretch images regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982793 (https://phabricator.wikimedia.org/T348876) (owner: 10Clément Goubert) [14:49:01] (03CR) 10Clément Goubert: docker-report: Fix stretch images regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982793 (https://phabricator.wikimedia.org/T348876) (owner: 10Clément Goubert) [14:49:12] hehe, nostalgiawiki at the top of the diffConfig ^^ https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/5414/console [14:49:29] (03CR) 10JMeybohm: [C: 03+1] docker-report: Fix stretch images regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982793 (https://phabricator.wikimedia.org/T348876) (owner: 10Clément Goubert) [14:49:41] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:982653|Utilities/Yaml: Use string as value with ini_set (T348496)]] (duration: 19m 09s) [14:49:46] T348496: Modernize code under util/ directory starting with T* - https://phabricator.wikimedia.org/T348496 [14:49:59] (03PS1) 10Vgutierrez: traffic: Provide a dashboard link for LVSRealServerMSS [alerts] - 10https://gerrit.wikimedia.org/r/982808 (https://phabricator.wikimedia.org/T351069) [14:50:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982105 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [14:50:49] let’s hope this finishes in time, I have a meeting in 10 minutes :S [14:50:59] :D [14:51:01] Same here [14:51:19] (03CR) 10CI reject: [V: 04-1] traffic: Provide a dashboard link for LVSRealServerMSS [alerts] - 10https://gerrit.wikimedia.org/r/982808 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:51:21] (03Merged) 10jenkins-bot: CheckUser: Enable read new for event tables migration on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982105 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [14:51:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 30%: Post clone (source of db1226) repooling', diff saved to https://phabricator.wikimedia.org/P54384 and previous config saved to /var/cache/conftool/dbconfig/20231213-145121-arnaudb.json [14:51:36] RECOVERY - Disk space on thanos-be1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [14:51:45] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:982105|CheckUser: Enable read new for event tables migration on group1 (T341829)]] [14:51:49] T341829: Enable read new for the event table migration - https://phabricator.wikimedia.org/T341829 [14:52:26] (03PS3) 10Muehlenhoff: Create initial stub role for logging-hd and configure for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/982801 (https://phabricator.wikimedia.org/T352517) [14:53:15] (03PS2) 10Clément Goubert: docker-report: Fix stretch images regex [puppet] - 10https://gerrit.wikimedia.org/r/982793 (https://phabricator.wikimedia.org/T348876) [14:53:22] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and dreamyjazz: Backport for [[gerrit:982105|CheckUser: Enable read new for event tables migration on group1 (T341829)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:53:26] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and dreamyjazz: Continuing with sync [14:53:46] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:11] (03CR) 10Hnowlan: [C: 03+1] mesh: Use ca-certificates instead of wmf-ca-certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/981341 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [14:54:18] (03PS2) 10Vgutierrez: traffic: Provide a dashboard link for LVSRealServerMSS [alerts] - 10https://gerrit.wikimedia.org/r/982808 (https://phabricator.wikimedia.org/T351069) [14:56:05] (03PS1) 10DCausse: cirrus-streaming-updater: bump consumer-searhc resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/982809 [14:57:03] (03PS2) 10DCausse: cirrus-streaming-updater: bump consumer-search resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/982809 [14:57:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982801 (https://phabricator.wikimedia.org/T352517) (owner: 10Muehlenhoff) [14:59:46] it’ll probably finish just in time :D [15:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231213T1500) [15:00:15] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:982105|CheckUser: Enable read new for event tables migration on group1 (T341829)]] (duration: 08m 29s) [15:00:21] !log UTC afternoon backport+config window done [15:00:27] bang on time :D [15:00:30] T341829: Enable read new for the event table migration - https://phabricator.wikimedia.org/T341829 [15:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:37] * Lucas_WMDE done [15:01:54] Thanks! [15:02:07] (03CR) 10Arnaudb: [C: 03+2] mariadb: decommission db1128 db1129 db1147 [puppet] - 10https://gerrit.wikimedia.org/r/982198 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [15:02:13] jouncebot: nowandnext [15:02:13] For the next 0 hour(s) and 57 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231213T1500) [15:02:13] In 2 hour(s) and 57 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231213T1800) [15:02:30] (03CR) 10Peter Fischer: [C: 03+1] "Alright, let's try this (once the metric fix is ready?)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982809 (owner: 10DCausse) [15:04:09] (03CR) 10Ladsgroup: [C: 03+2] docroot: Add my pgp key (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982499 (owner: 10Ladsgroup) [15:04:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'decommission db1128 29 and 47', diff saved to https://phabricator.wikimedia.org/P54385 and previous config saved to /var/cache/conftool/dbconfig/20231213-150425-arnaudb.json [15:05:00] (03Merged) 10jenkins-bot: docroot: Add my pgp key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982499 (owner: 10Ladsgroup) [15:05:20] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1128.eqiad.wmnet [15:06:07] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:982499|docroot: Add my pgp key]] [15:06:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Post clone (source of db1226) repooling', diff saved to https://phabricator.wikimedia.org/P54386 and previous config saved to /var/cache/conftool/dbconfig/20231213-150626-arnaudb.json [15:07:34] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:982499|docroot: Add my pgp key]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:09:04] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [15:09:26] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:10:22] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [15:12:06] (03CR) 10Hnowlan: [C: 03+1] docker-report: Fix stretch images regex [puppet] - 10https://gerrit.wikimedia.org/r/982793 (https://phabricator.wikimedia.org/T348876) (owner: 10Clément Goubert) [15:12:22] !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1128.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:13:28] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1128.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:13:29] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:13:29] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1128.eqiad.wmnet [15:15:28] 10ops-eqiad, 10decommission-hardware: decommission db1128.eqiad.wmnet - https://phabricator.wikimedia.org/T353326 (10ABran-WMF) [15:15:57] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:982499|docroot: Add my pgp key]] (duration: 09m 50s) [15:16:36] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1129.eqiad.wmnet [15:17:03] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1129.eqiad.wmnet onto db1233.eqiad.wmnet [15:21:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 50%: Post clone (source of db1226) repooling', diff saved to https://phabricator.wikimedia.org/P54387 and previous config saved to /var/cache/conftool/dbconfig/20231213-152131-arnaudb.json [15:21:37] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [15:22:48] (03PS1) 10Hnowlan: rest-gateway: correct device-analytics host header [deployment-charts] - 10https://gerrit.wikimedia.org/r/982816 [15:24:05] !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1129.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:24:41] (03CR) 10Clément Goubert: [C: 03+2] docker-report: Fix stretch images regex [puppet] - 10https://gerrit.wikimedia.org/r/982793 (https://phabricator.wikimedia.org/T348876) (owner: 10Clément Goubert) [15:25:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1129.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:25:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:25:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1129.eqiad.wmnet [15:25:13] 10ops-eqiad, 10decommission-hardware: decommission db1128.eqiad.wmnet - https://phabricator.wikimedia.org/T353326 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by arnaudb@cumin1001 for hosts: `db1129.eqiad.wmnet` - db1129.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - F... [15:26:11] 10ops-eqiad, 10decommission-hardware: decommission db1129.eqiad.wmnet - https://phabricator.wikimedia.org/T353327 (10ABran-WMF) [15:28:13] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1147.eqiad.wmnet [15:28:16] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:31:52] (03PS3) 10DCausse: cirrus-streaming-updater: bump consumer-search resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/982809 [15:32:05] (03CR) 10Clément Goubert: [C: 03+2] shellbox: update php-fpm-exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982432 (owner: 10Clément Goubert) [15:32:09] (03CR) 10Peter Fischer: [C: 03+2] cirrus-streaming-updater: bump consumer-search resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/982809 (owner: 10DCausse) [15:33:02] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [15:33:32] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump consumer-search resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/982809 (owner: 10DCausse) [15:34:15] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: correct device-analytics host header [deployment-charts] - 10https://gerrit.wikimedia.org/r/982816 (owner: 10Hnowlan) [15:34:19] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:34:53] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:35:02] !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1147.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:35:05] (03PS3) 10Clément Goubert: shellbox: update php-fpm-exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982432 [15:35:18] !log tagging 1.41.0-rc.0 in core [15:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:47] (03Merged) 10jenkins-bot: rest-gateway: correct device-analytics host header [deployment-charts] - 10https://gerrit.wikimedia.org/r/982816 (owner: 10Hnowlan) [15:36:17] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1147.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:36:17] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:36:18] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1147.eqiad.wmnet [15:36:27] 10SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137 (10PantheraLeo1359531) Just for clarification so that I understand: Afaik, the main reason for deletion the Age of Empires-related contents was the doubt regar... [15:36:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Post clone (source of db1226) repooling', diff saved to https://phabricator.wikimedia.org/P54389 and previous config saved to /var/cache/conftool/dbconfig/20231213-153636-arnaudb.json [15:37:04] 10ops-eqiad, 10decommission-hardware: decommission db1147.eqiad.wmnet - https://phabricator.wikimedia.org/T353330 (10ABran-WMF) [15:39:35] !log Deploying shellbox: update php-fpm-exporter version - 982432 [15:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:49] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [15:40:06] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [15:40:24] (03PS1) 10JMeybohm: kubernetes::master Add blackbox checks for kuber-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/982819 (https://phabricator.wikimedia.org/T353233) [15:42:56] (03PS2) 10Alexandros Kosiaris: mesh: Ship new configuration templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/981340 (https://phabricator.wikimedia.org/T352906) [15:42:58] (03PS2) 10Alexandros Kosiaris: mesh: Use ca-certificates instead of wmf-ca-certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/981341 (https://phabricator.wikimedia.org/T352906) [15:43:00] (03PS1) 10Alexandros Kosiaris: apertium/blubberoid: Bump mesh.configuration to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982820 (https://phabricator.wikimedia.org/T352906) [15:43:03] (03PS1) 10Alexandros Kosiaris: mobileapps: mesh.configuration:1.5.x to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982821 (https://phabricator.wikimedia.org/T352906) [15:43:05] (03PS1) 10Alexandros Kosiaris: function-orchestrator: Bump mesh.configuration:1.4.x to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982822 (https://phabricator.wikimedia.org/T352906) [15:43:07] (03PS1) 10Alexandros Kosiaris: Bump mesh.configuration:1.4.x to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982823 (https://phabricator.wikimedia.org/T352906) [15:43:27] (03CR) 10CI reject: [V: 04-1] kubernetes::master Add blackbox checks for kuber-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/982819 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [15:43:29] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/886/console" [puppet] - 10https://gerrit.wikimedia.org/r/982819 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [15:43:39] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [15:43:54] (03PS1) 10Ladsgroup: Fix my email in the key list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982824 [15:44:01] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [15:44:04] (03CR) 10CI reject: [V: 04-1] Fix my email in the key list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982824 (owner: 10Ladsgroup) [15:45:02] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [15:45:10] (03PS2) 10JMeybohm: kubernetes::master Add blackbox checks for kuber-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/982819 (https://phabricator.wikimedia.org/T353233) [15:46:14] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [15:48:25] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/982819 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [15:49:07] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [15:49:29] (03CR) 10JMeybohm: kubernetes::master Add blackbox checks for kuber-apiserver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982819 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [15:49:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] mesh: Ship new configuration templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/981340 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [15:50:02] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [15:50:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks for the +1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981341 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [15:50:32] (03Merged) 10jenkins-bot: mesh: Ship new configuration templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/981340 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [15:50:45] (03PS3) 10JMeybohm: kubernetes::master Add blackbox checks for kuber-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/982819 (https://phabricator.wikimedia.org/T353233) [15:50:56] (03Merged) 10jenkins-bot: mesh: Use ca-certificates instead of wmf-ca-certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/981341 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [15:51:10] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [15:51:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium/blubberoid: Bump mesh.configuration to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982820 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [15:51:26] (03PS2) 10Alexandros Kosiaris: function-orchestrator: Bump mesh.configuration:1.6.x to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982822 (https://phabricator.wikimedia.org/T352906) [15:51:28] (03PS2) 10Alexandros Kosiaris: Bump mesh.configuration:1.4.x to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982823 (https://phabricator.wikimedia.org/T352906) [15:51:43] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [15:51:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 70%: Post clone (source of db1226) repooling', diff saved to https://phabricator.wikimedia.org/P54392 and previous config saved to /var/cache/conftool/dbconfig/20231213-155142-arnaudb.json [15:51:58] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [15:52:08] (03Merged) 10jenkins-bot: apertium/blubberoid: Bump mesh.configuration to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982820 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [15:52:20] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [15:52:41] (03CR) 10Filippo Giunchedi: [C: 03+1] kubernetes::master Add blackbox checks for kuber-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/982819 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [15:52:59] (PuppetFailure) firing: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:53:54] (03PS1) 10Btullis: Add a spark system user/group for the spark-history service [puppet] - 10https://gerrit.wikimedia.org/r/982846 (https://phabricator.wikimedia.org/T352838) [15:55:42] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1128.eqiad.wmnet - https://phabricator.wikimedia.org/T353326 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF [15:55:56] (03CR) 10JMeybohm: [C: 03+2] kubernetes::master Add blackbox checks for kuber-apiserver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982819 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [15:56:01] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1128.eqiad.wmnet - https://phabricator.wikimedia.org/T353326 (10VRiley-WMF) [15:56:15] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [15:56:35] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [15:57:03] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [15:58:05] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [15:58:17] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1132.eqiad.wmnet onto db1232.eqiad.wmnet [15:58:46] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [15:58:47] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/apertium: apply [15:59:13] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/apertium: apply [15:59:14] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/apertium: apply [15:59:15] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [15:59:38] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [15:59:39] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/apertium: apply [15:59:58] !log upgrade apertium, bluebberoid everywhere to use the latest service_proxy image, 1.23.10-2-s4-20231203 T352906 [16:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] T352906: mediawiki k8s jobrunner fails connecting to cloudelastic with a TLS error - https://phabricator.wikimedia.org/T352906 [16:00:11] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [16:00:19] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [16:00:34] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [16:01:01] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [16:01:19] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [16:01:35] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:01:54] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:02:59] (PuppetFailure) resolved: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:03:11] (03PS1) 10Alexandros Kosiaris: Revert "citoid: Set service_mesh version to 1.23.10-2-s4-20231203" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982831 [16:03:18] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:03:21] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:03:58] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/blubberoid: apply [16:03:58] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1138.eqiad.wmnet - https://phabricator.wikimedia.org/T353148 (10VRiley-WMF) 05In progress→03Resolved a:03VRiley-WMF [16:04:09] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [16:04:11] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/blubberoid: apply [16:04:14] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:04:29] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply [16:04:30] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply [16:04:34] (03CR) 10Hnowlan: [C: 03+1] Revert "citoid: Set service_mesh version to 1.23.10-2-s4-20231203" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982831 (owner: 10Alexandros Kosiaris) [16:04:49] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply [16:05:11] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:05:21] (03PS2) 10Ladsgroup: Fix my email in the key list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982824 [16:05:34] (03CR) 10Ladsgroup: [C: 03+2] Fix my email in the key list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982824 (owner: 10Ladsgroup) [16:05:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "citoid: Set service_mesh version to 1.23.10-2-s4-20231203" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982831 (owner: 10Alexandros Kosiaris) [16:06:10] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: mesh.configuration:1.5.x to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982821 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [16:06:22] (03PS1) 10Ilias Sarantopoulos: ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/982849 [16:06:30] (03Merged) 10jenkins-bot: Revert "citoid: Set service_mesh version to 1.23.10-2-s4-20231203" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982831 (owner: 10Alexandros Kosiaris) [16:06:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 80%: Post clone (source of db1226) repooling', diff saved to https://phabricator.wikimedia.org/P54393 and previous config saved to /var/cache/conftool/dbconfig/20231213-160647-arnaudb.json [16:06:48] (03Merged) 10jenkins-bot: Fix my email in the key list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982824 (owner: 10Ladsgroup) [16:07:10] (03Merged) 10jenkins-bot: mobileapps: mesh.configuration:1.5.x to latest patch level [deployment-charts] - 10https://gerrit.wikimedia.org/r/982821 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [16:07:43] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:982824|Fix my email in the key list]] [16:08:15] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:08:19] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission rdb1009, rdb1010 - https://phabricator.wikimedia.org/T352547 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [16:08:38] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:09:21] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:982824|Fix my email in the key list]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:09:27] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [16:09:44] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [16:09:57] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [16:11:20] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [16:12:02] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [16:12:27] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission rdb1009, rdb1010 - https://phabricator.wikimedia.org/T352547 (10VRiley-WMF) 05Open→03Resolved [16:12:52] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [16:13:26] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [16:13:33] (03CR) 10Elukey: [C: 03+1] ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/982849 (owner: 10Ilias Sarantopoulos) [16:14:30] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/982849 (owner: 10Ilias Sarantopoulos) [16:14:49] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new extra plugins - bking@cumin2002 - T353270 [16:15:04] T353270: Update elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 [16:15:07] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [16:15:27] (03Merged) 10jenkins-bot: ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/982849 (owner: 10Ilias Sarantopoulos) [16:15:35] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [16:16:28] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:982824|Fix my email in the key list]] (duration: 08m 45s) [16:18:31] (03PS1) 10JMeybohm: kubernetes::master Add blackbox checks for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/982852 (https://phabricator.wikimedia.org/T353233) [16:18:41] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [16:19:25] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:19:39] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:20:30] (03PS2) 10JMeybohm: kubernetes::master Add blackbox checks for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/982852 (https://phabricator.wikimedia.org/T353233) [16:20:53] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sessionstore1004 [16:20:56] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 90%: Post clone (source of db1226) repooling', diff saved to https://phabricator.wikimedia.org/P54394 and previous config saved to /var/cache/conftool/dbconfig/20231213-162152-arnaudb.json [16:22:28] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sessionstore1004 [16:23:12] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host sessionstore1004.mgmt.eqiad.wmnet with reboot policy FORCED [16:24:37] (03PS1) 10JMeybohm: pki::multirootca: Merge custom profiles on top of default_profiles [puppet] - 10https://gerrit.wikimedia.org/r/982854 (https://phabricator.wikimedia.org/T353314) [16:25:06] (03CR) 10JMeybohm: [C: 03+2] kubernetes::master Add blackbox checks for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/982852 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [16:25:28] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sessionstore1005 [16:26:37] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [16:26:40] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sessionstore1005 [16:27:10] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [16:27:31] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sessionstore1006 [16:27:57] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/887/con" [puppet] - 10https://gerrit.wikimedia.org/r/982854 (https://phabricator.wikimedia.org/T353314) (owner: 10JMeybohm) [16:28:15] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host sessionstore1005.mgmt.eqiad.wmnet with reboot policy FORCED [16:29:19] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sessionstore1006 [16:30:07] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host sessionstore1006.mgmt.eqiad.wmnet with reboot policy FORCED [16:31:50] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sessionstore1004.mgmt.eqiad.wmnet with reboot policy FORCED [16:34:54] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:34:57] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:35:38] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:35:49] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:36:05] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:36:18] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:36:56] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sessionstore1005.mgmt.eqiad.wmnet with reboot policy FORCED [16:36:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Post clone (source of db1226) repooling', diff saved to https://phabricator.wikimedia.org/P54395 and previous config saved to /var/cache/conftool/dbconfig/20231213-163657-arnaudb.json [16:36:59] (ProbeDown) firing: (4) Service kubemaster1001:6443 has failed probes (http_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:38:42] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sessionstore1006.mgmt.eqiad.wmnet with reboot policy FORCED [16:38:43] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [16:39:17] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new extra plugins - bking@cumin2002 - T353270 [16:39:19] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [16:39:24] T353270: Update elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 [16:43:40] (03PS1) 10DDesouza: Partially undeploy Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982857 (https://phabricator.wikimedia.org/T344393) [16:47:04] (03PS1) 10JMeybohm: kubernetes::master Group blackbox checks per cluster [puppet] - 10https://gerrit.wikimedia.org/r/982858 (https://phabricator.wikimedia.org/T353233) [16:47:34] (ProbeDown) firing: Service releases1003:443 has failed probes (http_releases_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:46] (ProbeDown) firing: (9) Service kubemaster1001:6443 has failed probes (http_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:58] me [16:49:20] ack [16:49:58] if only it would tell why :) [16:51:59] (ProbeDown) firing: (18) Service aux-k8s-ctrl1002:6443 has failed probes (http_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:52:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10VRiley-WMF) [16:52:34] (ProbeDown) resolved: Service releases1003:443 has failed probes (http_releases_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:52:37] jayme if you use 'task' to create a ticket for those kinda failures, it has a nice link to logstash https://phabricator.wikimedia.org/T352083 [16:53:58] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1148.eqiad.wmnet onto db1248.eqiad.wmnet [16:54:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10VRiley-WMF) [16:54:38] ...or not , I don't see http_kube_apiserver_ip4 as a 'service.name' on that dashboard [16:55:03] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore1004'] [16:55:39] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore1005'] [16:55:41] inflatador: yeah - the logstash dashboard can be filtered ofc [16:55:55] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore1006'] [16:55:58] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:17] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore1004'] [16:56:24] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore1005'] [16:56:29] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore1006'] [16:56:43] jayme indeed, looks like a cert error? https://logstash.wikimedia.org/goto/e447bdde1cde2aebf27eb65f71bbe877 [16:57:03] yeah, it's a config mixup I suppose [16:57:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10VRiley-WMF) [16:58:47] It would be nice to have an AM template that puts the logstash link in the alert msg [16:58:54] (03CR) 10Jdlrobson: [C: 03+1] Filter errors originating in external tools [puppet] - 10https://gerrit.wikimedia.org/r/975836 (https://phabricator.wikimedia.org/T349935) (owner: 10Jdlrobson) [16:59:55] (03CR) 10Jdlrobson: [C: 03+1] Filter errors originating in external tools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975836 (https://phabricator.wikimedia.org/T349935) (owner: 10Jdlrobson) [17:00:08] inflatador: to be fair, there is a link on the grafana dashboard [17:00:43] (03CR) 10CDanis: [C: 03+1] pki::multirootca: Merge custom profiles on top of default_profiles [puppet] - 10https://gerrit.wikimedia.org/r/982854 (https://phabricator.wikimedia.org/T353314) (owner: 10JMeybohm) [17:01:06] jayme I agree...I'd like logstash though, as I am terrible at finding things in logstash ;) [17:02:30] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 128, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:03:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:04:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:04:33] (03PS2) 10JMeybohm: kubernetes::master Group blackbox checks per cluster [puppet] - 10https://gerrit.wikimedia.org/r/982858 (https://phabricator.wikimedia.org/T353233) [17:06:08] inflatador: ^ that should fix it if you're curious [17:07:53] (03CR) 10JMeybohm: [C: 03+2] kubernetes::master Group blackbox checks per cluster [puppet] - 10https://gerrit.wikimedia.org/r/982858 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [17:07:56] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:40] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51008 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:10:00] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.409 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:11:27] (03CR) 10RLazarus: [C: 03+2] admin_ng: Add the sidecar-job-controller ServiceAccount [deployment-charts] - 10https://gerrit.wikimedia.org/r/982497 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [17:13:29] (03PS1) 10JMeybohm: kubernetes::master Fix syntax error concatenating strings [puppet] - 10https://gerrit.wikimedia.org/r/982863 (https://phabricator.wikimedia.org/T353233) [17:14:13] (03Merged) 10jenkins-bot: admin_ng: Add the sidecar-job-controller ServiceAccount [deployment-charts] - 10https://gerrit.wikimedia.org/r/982497 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [17:16:28] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/888/console" [puppet] - 10https://gerrit.wikimedia.org/r/982863 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [17:16:59] (ProbeDown) firing: (24) Service aux-k8s-ctrl1002:6443 has failed probes (http_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:17:28] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master Fix syntax error concatenating strings [puppet] - 10https://gerrit.wikimedia.org/r/982863 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [17:17:56] (03CR) 10Jforrester: [C: 03+1] RunSingleJob.php: Remove overly complicated error handling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982414 (https://phabricator.wikimedia.org/T353262) (owner: 10Bartosz Dziewoński) [17:18:47] (ProbeDown) firing: (26) Service aux-k8s-ctrl1002:6443 has failed probes (http_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:19:14] still working on that one...give it another 5' [17:21:59] (ProbeDown) firing: (32) Service aux-k8s-ctrl1002:6443 has failed probes (http_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:24:58] jayme: okay to deploy admin_ng with this going on, or should I hold off? [17:25:11] rzl: go ahead [17:25:15] 👍 [17:25:28] it's me breaking prometheus config - k8s is fine [17:25:40] !log rzl@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:27:13] !log rzl@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:32:04] (ProbeDown) firing: (32) Service aux-k8s-ctrl1002:6443 has failed probes (http_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:32:07] (CertAlmostExpired) firing: (2) Certificate for service kubestagemaster2001:6443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#kubestagemaster2001:6443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:33:46] (ProbeDown) resolved: (32) Service aux-k8s-ctrl1002:6443 has failed probes (http_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:37:43] (03PS1) 10JMeybohm: kubernetes::master Fix logic for certificate_expiry_days [puppet] - 10https://gerrit.wikimedia.org/r/982889 (https://phabricator.wikimedia.org/T353233) [17:40:12] 10SRE, 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T351279 (10Papaul) a:03VRiley-WMF [17:44:04] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply new extra plugins - bking@cumin2002 - T353270 [17:44:11] (03PS2) 10JMeybohm: kubernetes::master Fix logic for certificate_expiry_days [puppet] - 10https://gerrit.wikimedia.org/r/982889 (https://phabricator.wikimedia.org/T353233) [17:44:16] T353270: Update elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 [17:46:19] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/889/c" [puppet] - 10https://gerrit.wikimedia.org/r/982889 (https://phabricator.wikimedia.org/T353233) (owner: 10JMeybohm) [17:48:08] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1147.eqiad.wmnet - https://phabricator.wikimedia.org/T353330 (10VRiley-WMF) [17:48:48] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1147.eqiad.wmnet - https://phabricator.wikimedia.org/T353330 (10VRiley-WMF) 05Open→03Resolved a:05ABran-WMF→03VRiley-WMF [17:52:24] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1129.eqiad.wmnet - https://phabricator.wikimedia.org/T353327 (10VRiley-WMF) 05Open→03Resolved a:05ABran-WMF→03VRiley-WMF [17:53:23] 10SRE, 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T351279 (10VRiley-WMF) Rebalanced power in cabinet. [17:53:33] 10SRE, 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T351279 (10VRiley-WMF) 05Open→03Resolved [17:57:51] !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:58:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253 (10VRiley-WMF) [17:58:30] !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:58:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253 (10VRiley-WMF) [18:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231213T1800) [18:05:02] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [18:05:06] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:06:25] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [18:06:36] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:07:19] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [18:07:54] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [18:07:57] 10Puppet, 10Wikimedia Meet: Puppetize the jitsi instance - https://phabricator.wikimedia.org/T251040 (10Ladsgroup) 05Open→03Declined Wikimedia Meet has been retired [18:22:22] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:06] brennen and hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Train log triage with CPT . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231213T1900). [19:00:06] brennen and hashar: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231213T1900). [19:01:20] o/ [19:01:38] !log 1.42.0-wmf.9 (T350085) status: no blockers, rolling to group1 [19:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:44] T350085: 1.42.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T350085 [19:03:57] I had enough of that bot [19:03:57] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply new extra plugins - bking@cumin2002 - T353270 [19:04:01] T353270: Update elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 [19:04:09] I am patching jouncebot to skip the lame humor whenever I am scheduled [19:12:24] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.9 refs T350085 [19:12:29] T350085: 1.42.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T350085 [19:13:42] (03PS1) 10Kimberly Sarabia: Add new stream names to the config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982903 (https://phabricator.wikimedia.org/T353297) [19:14:16] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10cmooney) p:05Triage→03Low [19:19:42] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10cmooney) [19:19:54] !log brennen@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.9 refs T350085 (duration: 07m 29s) [19:19:59] T350085: 1.42.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T350085 [19:21:24] (03PS7) 10Kimberly Sarabia: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) [19:23:41] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10cmooney) Also - we should change the regexp to also catch "et-" prefixes for 25G interfaces ` REGEXP "^(g|x)e-[0-9]+/[0-9]+/[0-9]+" ` [19:28:16] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:31:20] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2031.codfw.wmnet [19:31:21] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2031.codfw.wmnet [19:33:51] (03PS2) 10Eevans: restbase: set production role and add config for restbase2032 [puppet] - 10https://gerrit.wikimedia.org/r/981606 (https://phabricator.wikimedia.org/T352468) [19:33:53] (03PS2) 10Eevans: restbase: set production role and add config for restbase2033 [puppet] - 10https://gerrit.wikimedia.org/r/981607 (https://phabricator.wikimedia.org/T352468) [19:33:55] (03PS2) 10Eevans: restbase: set production role and add config for restbase2034 [puppet] - 10https://gerrit.wikimedia.org/r/981608 (https://phabricator.wikimedia.org/T352468) [19:33:57] (03PS2) 10Eevans: restbase: set production role and add config for restbase2035 [puppet] - 10https://gerrit.wikimedia.org/r/981609 (https://phabricator.wikimedia.org/T352468) [19:35:40] (03CR) 10Eevans: [C: 03+2] restbase: set production role and add config for restbase2032 [puppet] - 10https://gerrit.wikimedia.org/r/981606 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [19:35:58] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: prepare new public & internal hosts [puppet] - 10https://gerrit.wikimedia.org/r/982463 (https://phabricator.wikimedia.org/T982172) (owner: 10Ryan Kemper) [19:51:06] 10SRE, 10SRE-Access-Requests: Requesting access to Releng for sandeeps - https://phabricator.wikimedia.org/T353186 (10jhathaway) [19:55:22] 10SRE, 10SRE-Access-Requests: Requesting access to Releng for sandeeps - https://phabricator.wikimedia.org/T353186 (10jhathaway) [19:59:00] 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10ANakanishi_WMF) [20:04:22] (03PS1) 10JHathaway: admin: shell & releng access for sandeeps [puppet] - 10https://gerrit.wikimedia.org/r/982912 (https://phabricator.wikimedia.org/T353186) [20:06:33] (03CR) 10Cwhite: [C: 03+2] Filter errors originating in external tools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975836 (https://phabricator.wikimedia.org/T349935) (owner: 10Jdlrobson) [20:08:46] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:09:38] PROBLEM - cassandra-a CQL 10.192.32.229:9042 on restbase2032 is CRITICAL: connect to address 10.192.32.229 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [20:09:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:15:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253 (10VRiley-WMF) 05Open→03Resolved a:05Fabfur→03VRiley-WMF [20:16:39] 10Puppet, 10Instrument-ClientError: Google Translate and other translate services triggering client error alert - https://phabricator.wikimedia.org/T351738 (10colewhite) Patch is merged. I see a corresponding [[ https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22000000026%22,%22queri... [20:18:12] (03PS1) 10Subramanya Sastry: Temporarily disable isPreview in Parsoid's rendering [extensions/DiscussionTools] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982834 [20:21:31] 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Aki Nakanishi - https://phabricator.wikimedia.org/T353363 (10jhathaway) @ANakanishi_WMF happy to help, they will first need a developer account, https://idm.wikimedia.org/signup, before I can pro... [20:21:33] 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10jhathaway) @ANakanishi_WMF happy to help, they will first need a developer account, https://idm.wikimedia.org/signup, before I can procee... [20:24:44] (03CR) 10CI reject: [V: 04-1] Temporarily disable isPreview in Parsoid's rendering [extensions/DiscussionTools] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982834 (owner: 10Subramanya Sastry) [20:26:48] (03CR) 10Subramanya Sastry: "recheck" [extensions/DiscussionTools] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982834 (owner: 10Subramanya Sastry) [20:27:22] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:28:41] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts planet1002.eqiad.wmnet [20:30:16] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:32:12] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [20:33:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:33:35] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts planet1002.eqiad.wmnet [20:34:14] (03PS1) 10Dwisehaupt: Install community_civicrm on crm role [puppet] - 10https://gerrit.wikimedia.org/r/982914 (https://phabricator.wikimedia.org/T343486) [20:36:19] (03CR) 10Subramanya Sastry: "recheck" [extensions/DiscussionTools] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982834 (owner: 10Subramanya Sastry) [20:37:17] (03CR) 10Subramanya Sastry: "Why is that test failing (doesn't seem related to this patch at all)? Trying once more. Is something off on wmf.9?" [extensions/DiscussionTools] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982834 (owner: 10Subramanya Sastry) [20:38:09] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10cmooney) [20:38:31] (03PS10) 10Cwhite: Enable $wgStatsTarget for requests to mwdebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) [20:41:07] (03CR) 10Bartosz Dziewoński: Temporarily disable isPreview in Parsoid's rendering (031 comment) [extensions/DiscussionTools] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982834 (owner: 10Subramanya Sastry) [20:45:40] 10SRE-tools, 10Spicerack: spicerack.ganeti.GanetiError: Error while performing request to RAPI - https://phabricator.wikimedia.org/T353379 (10Dzahn) [20:46:22] (03PS1) 10Subramanya Sastry: tests: Use MediaWikiIntegrationTestCase::setGroupPermissions [extensions/CheckUser] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982835 (https://phabricator.wikimedia.org/T353210) [20:46:33] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.ganeti.GanetiError: Error while performing request to RAPI - https://phabricator.wikimedia.org/T353379 (10Dzahn) [20:47:36] (03CR) 10Subramanya Sastry: "This is needed on wmf.9 so that cherrypicks onto wmf.9 don't fail CI." [extensions/CheckUser] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982835 (https://phabricator.wikimedia.org/T353210) (owner: 10Subramanya Sastry) [20:48:30] (03PS2) 10Subramanya Sastry: Temporarily disable isPreview in Parsoid's rendering [extensions/DiscussionTools] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982834 [20:59:59] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T353295 (10BCornwall) @MoritzMuehlenhoff The other instance also have 10G. Would you still recommend that despite it bringing inconsistency? [21:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231213T2100). [21:00:06] jdlrobson, danisztls, subbu, and cwhite: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:15] o/ [21:00:46] o/ [21:01:44] present [21:02:09] o/ [21:10:47] Any deployers around? [21:16:19] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10ayounsi) More or less a duplicate of {T306007} [21:20:24] I guess more of us should sign up to become deployers and train. [21:20:41] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10cmooney) [21:20:51] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netbox, 10netops: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007 (10cmooney) [21:22:12] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10cmooney) Ah yeah I'd forgotten about that one. What do you think about changing the alert text? I'm sure after investigating today I'll remember the details... [21:24:32] * subbu signed up for deployment training [21:27:47] subbu: while we wait for you to be trained, I've poked in -releng for today [21:27:59] Cc cwhite Jdlrobson [21:28:14] Yes, training will not happen till tomorrow at the earliest. :) thanks! [21:28:32] thanks @RhinosF1 [21:29:36] hello, do we need any backports? [21:29:54] jeena: yes a few [21:30:11] let me get logged into the deploy server and then I can start [21:31:49] I'll go in order of what's on the deployment calendar list if that's okay [21:32:04] starting with Jdlrobson [21:32:07] (CertAlmostExpired) firing: (2) Certificate for service kubestagemaster2001:6443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#kubestagemaster2001:6443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:33:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982244 (https://phabricator.wikimedia.org/T352456) (owner: 10Jdlrobson) [21:33:49] sounds good! [21:36:47] thx jeena [21:36:51] np [21:44:43] jeena reg https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/982835 .. I think it is enough to just +2 it and not actually deploy it. It is needed on wmf.9 so that other cherry-picked patches pass CI .. But, since it is tests only, I don't think it actually needs to be on the severs. But, I'll leave you to make that call. [21:45:38] (In advance, we're not using the slot after this, so feel free to use it for backports if needed.) [21:46:19] subbu: that will be fine I think [21:46:23] thanks James_F ! [21:47:29] subbu: I see no one has reviewed it though [21:47:50] it is a cherry-pick and was a release blocker for wmf.9 [21:47:56] ah ok [21:48:17] https://phabricator.wikimedia.org/T353210 is the relevant task. [21:49:02] 👍 [21:50:02] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.ganeti.GanetiError: Error while performing request to RAPI - https://phabricator.wikimedia.org/T353379 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff You need to run Ganeti-related cookbooks from cumin2002 until cumin1001... [21:50:26] (03Merged) 10jenkins-bot: Restore fixed width and height, direction of arrow on change list pages [core] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982244 (https://phabricator.wikimedia.org/T352456) (owner: 10Jdlrobson) [21:50:52] !log jhuneidi@deploy2002 Started scap: Backport for [[gerrit:982244|Restore fixed width and height, direction of arrow on change list pages (T352456 T353099)]] [21:50:58] T352456: watchlist collapsible-item markers are now smaller than yesterday and point up instead of right when collapsed - https://phabricator.wikimedia.org/T352456 [21:50:58] T353099: Watchlist grouping icons became backwards in RTL - https://phabricator.wikimedia.org/T353099 [21:52:18] !log jhuneidi@deploy2002 jhuneidi and jdlrobson: Backport for [[gerrit:982244|Restore fixed width and height, direction of arrow on change list pages (T352456 T353099)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:52:43] Jdlrobson: lmk when to continue with sync [21:53:28] jeena: looking! [21:53:46] danisztls: I will go ahead and +2 your patch [21:54:01] (03CR) 10Jeena Huneidi: [C: 03+2] Partially undeploy Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982857 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:54:04] jeena: LGTM ! please sync [21:54:10] !log jhuneidi@deploy2002 jhuneidi and jdlrobson: Continuing with sync [21:55:05] (03Merged) 10jenkins-bot: Partially undeploy Reader Demographics 2 survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982857 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:57:48] jeena: looks like mine is still out another 10 mins at least .. afk for about 5-10 mins as I relocate from the coffee shop to home. [21:58:00] okay [21:59:47] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply new extra plugins - bking@cumin2002 - T353270 [21:59:51] T353270: Update elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 [22:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231213T2200) [22:01:06] cwhite, are you here for the backport window? [22:01:21] !log jhuneidi@deploy2002 Finished scap: Backport for [[gerrit:982244|Restore fixed width and height, direction of arrow on change list pages (T352456 T353099)]] (duration: 10m 28s) [22:01:31] yep! [22:01:31] T352456: watchlist collapsible-item markers are now smaller than yesterday and point up instead of right when collapsed - https://phabricator.wikimedia.org/T352456 [22:01:32] T353099: Watchlist grouping icons became backwards in RTL - https://phabricator.wikimedia.org/T353099 [22:02:17] would it be okay to backport your patch and this patch at the same time? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/982857 or you can wait until after [22:02:47] danisztls: are you ready for backport? [22:02:56] jeena: it is ok but I can wait if you prefer [22:03:06] simultaneously is ok by me [22:03:09] I've already merged your patch so it's next in line [22:03:28] okay cwhite i'll add yours as well [22:04:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [22:05:23] (03Merged) 10jenkins-bot: Enable $wgStatsTarget for requests to mwdebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [22:05:30] here [22:05:49] !log jhuneidi@deploy2002 Started scap: Backport for [[gerrit:982857|Partially undeploy Reader Demographics 2 survey (T344393)]], [[gerrit:955015|Enable $wgStatsTarget for requests to mwdebug (T240685)]] [22:05:56] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [22:05:57] T240685: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 [22:05:57] subbu, I'll go ahead and +2 your first patch [22:06:06] yes [22:06:20] (03CR) 10Jeena Huneidi: [C: 03+2] tests: Use MediaWikiIntegrationTestCase::setGroupPermissions [extensions/CheckUser] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982835 (https://phabricator.wikimedia.org/T353210) (owner: 10Subramanya Sastry) [22:07:13] !log jhuneidi@deploy2002 dani and jhuneidi and cwhite: Backport for [[gerrit:982857|Partially undeploy Reader Demographics 2 survey (T344393)]], [[gerrit:955015|Enable $wgStatsTarget for requests to mwdebug (T240685)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:07:25] danisztls: cwhite ready for you to check on debug [22:08:13] jeena: looks good [22:09:58] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore1004.eqiad.wmnet with OS bullseye [22:10:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host sessionstore1004.eqiad.wmnet with OS bullseye [22:11:05] jeena: looks good from here too [22:11:10] cool thanks [22:11:13] !log jhuneidi@deploy2002 dani and jhuneidi and cwhite: Continuing with sync [22:11:50] !log brett@cumin2002 START - Cookbook sre.ganeti.makevm for new host acmechief1002.eqiad.wmnet [22:11:52] !log brett@cumin2002 START - Cookbook sre.dns.netbox [22:14:07] thanks jeena would you also be able to +2 a beta cluster only change? [22:14:39] yeah, sure [22:14:57] let me finish up with subbu's stuff and then I can do that? [22:15:25] or technically I guess it can go in at the same time since it's a no-op? [22:15:26] (03PS1) 10Dwisehaupt: Fix up a typo in community_civicrm::config_nonce [labs/private] - 10https://gerrit.wikimedia.org/r/982924 (https://phabricator.wikimedia.org/T343486) [22:15:51] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM acmechief1002.eqiad.wmnet - brett@cumin2002" [22:16:44] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM acmechief1002.eqiad.wmnet - brett@cumin2002" [22:16:44] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:16:44] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache acmechief1002.eqiad.wmnet on all recursors [22:16:47] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) acmechief1002.eqiad.wmnet on all recursors [22:17:14] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM acmechief1002.eqiad.wmnet - brett@cumin2002" [22:17:18] hey jeena, found a typo that somehow didn't get caught in the original patch. any chance could get a single-character change deployed? [22:17:41] s/caught/corrected [22:18:07] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM acmechief1002.eqiad.wmnet - brett@cumin2002" [22:18:22] !log jhuneidi@deploy2002 Finished scap: Backport for [[gerrit:982857|Partially undeploy Reader Demographics 2 survey (T344393)]], [[gerrit:955015|Enable $wgStatsTarget for requests to mwdebug (T240685)]] (duration: 12m 33s) [22:18:27] cwhite: yes I can do that if you have the time to wait [22:18:27] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [22:18:27] T240685: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 [22:18:43] cool, pushing it up now [22:19:03] subbu: just waiting on tests now I think [22:19:13] ack [22:19:41] (03PS1) 10Jdlrobson: [BC] Enable desktop diff and history pages on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982925 (https://phabricator.wikimedia.org/T350181) [22:19:57] jeena: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/982925 is the patch. No rush [22:20:28] (03PS1) 10Cwhite: Update wgStatsTarget to port 9125 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982867 (https://phabricator.wikimedia.org/T240685) [22:20:54] jeena: ^^ [22:20:57] Sorry about that! [22:20:59] Jdlrobson: cwhite can you add your new patches to the deployment calendar? [22:21:05] wilco [22:21:10] thanks [22:22:34] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1004.eqiad.wmnet with reason: host reimage [22:23:02] (03PS1) 10BCornwall: Create hieradata for host acmechief1002 [puppet] - 10https://gerrit.wikimedia.org/r/982926 (https://phabricator.wikimedia.org/T352242) [22:23:48] done! [22:24:13] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore1005.eqiad.wmnet with OS bullseye [22:24:33] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore1006.eqiad.wmnet with OS bullseye [22:25:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host sessionstore1005.eqiad.wmnet with OS bullseye [22:25:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host sessionstore1006.eqiad.wmnet with OS bullseye [22:26:05] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1004.eqiad.wmnet with reason: host reimage [22:27:04] that ci is taking its own sweet time! [22:27:12] yeah! [22:27:27] but I guess I can still +2 your second change since it has a depends-on? [22:27:45] (03CR) 10BBlack: [C: 03+1] Create hieradata for host acmechief1002 [puppet] - 10https://gerrit.wikimedia.org/r/982926 (https://phabricator.wikimedia.org/T352242) (owner: 10BCornwall) [22:27:58] yes [22:28:01] (03Merged) 10jenkins-bot: tests: Use MediaWikiIntegrationTestCase::setGroupPermissions [extensions/CheckUser] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982835 (https://phabricator.wikimedia.org/T353210) (owner: 10Subramanya Sastry) [22:28:12] (03CR) 10Jeena Huneidi: [C: 03+2] Temporarily disable isPreview in Parsoid's rendering [extensions/DiscussionTools] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982834 (owner: 10Subramanya Sastry) [22:29:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy2002 using scap backport" [extensions/DiscussionTools] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982834 (owner: 10Subramanya Sastry) [22:29:51] (03PS1) 10Ryan Kemper: search: new hosts need puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/982927 [22:32:39] jeena: done [22:33:18] (03CR) 10BCornwall: [C: 03+2] Create hieradata for host acmechief1002 [puppet] - 10https://gerrit.wikimedia.org/r/982926 (https://phabricator.wikimedia.org/T352242) (owner: 10BCornwall) [22:33:50] thanks! just waiting on a backport [22:34:43] (03Merged) 10jenkins-bot: Temporarily disable isPreview in Parsoid's rendering [extensions/DiscussionTools] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/982834 (owner: 10Subramanya Sastry) [22:35:12] !log jhuneidi@deploy2002 Started scap: Backport for [[gerrit:982835|tests: Use MediaWikiIntegrationTestCase::setGroupPermissions (T353210)]], [[gerrit:982834|Temporarily disable isPreview in Parsoid's rendering]] [22:35:12] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host acmechief1002.eqiad.wmnet with OS bookworm [22:35:18] T353210: CheckUser tests failing for ApiQueryCheckUserLog and SpecialCheckUserLog: "You don't have permission to check users' IP addresses and other information" - https://phabricator.wikimedia.org/T353210 [22:35:20] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T353295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host acmechief1002.eqiad.wmnet with OS bookworm [22:36:34] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1005.eqiad.wmnet with reason: host reimage [22:37:03] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1006.eqiad.wmnet with reason: host reimage [22:37:17] !log jhuneidi@deploy2002 ssastry and jhuneidi: Backport for [[gerrit:982835|tests: Use MediaWikiIntegrationTestCase::setGroupPermissions (T353210)]], [[gerrit:982834|Temporarily disable isPreview in Parsoid's rendering]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:37:27] subbu all ready for you to check [22:37:52] thanks. I can only test this on wikitech .. and looks like wikimedia-debug extension doesn't apply to that domain. [22:38:07] ah okay, so i'll go ahead and sync then [22:38:14] any other options for verifying there? otherwise, yes .. go ahead. [22:38:23] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1107.eqiad.wmnet with OS bookworm [22:38:25] !log jhuneidi@deploy2002 ssastry and jhuneidi: Continuing with sync [22:38:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10RKemper) Been seeing some weirdness on `elastic1107` (internal search team alerts for `PuppetZeroResources` and the like) so we'll see if a fresh reimage smooths things over [22:38:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm [22:38:50] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1107.eqiad.wmnet with OS bookworm [22:38:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm executed with errors: - elastic1... [22:39:11] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001" [22:39:24] Jdlrobson: cwhite i'll backport your patches together [22:39:55] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1005.eqiad.wmnet with reason: host reimage [22:40:12] sounds good! [22:40:32] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001" [22:40:34] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1004.eqiad.wmnet with OS bullseye [22:40:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host sessionstore1004.eqiad.wmnet with OS bullseye completed: - sessionstore... [22:42:21] thanks jeena [22:42:34] no testing needed on mine [22:42:34] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1006.eqiad.wmnet with reason: host reimage [22:44:15] 👍 [22:45:14] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1107.eqiad.wmnet with OS bookworm [22:45:19] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1107.eqiad.wmnet with OS bookworm [22:45:20] !log jhuneidi@deploy2002 Finished scap: Backport for [[gerrit:982835|tests: Use MediaWikiIntegrationTestCase::setGroupPermissions (T353210)]], [[gerrit:982834|Temporarily disable isPreview in Parsoid's rendering]] (duration: 10m 08s) [22:45:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm [22:45:26] T353210: CheckUser tests failing for ApiQueryCheckUserLog and SpecialCheckUserLog: "You don't have permission to check users' IP addresses and other information" - https://phabricator.wikimedia.org/T353210 [22:45:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm executed with errors: - elastic1... [22:45:46] Jdlrobson: cwhite doing yours now [22:45:56] jeena: all done with mine? [22:46:05] subbu: yup all synced [22:46:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982867 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [22:46:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982925 (https://phabricator.wikimedia.org/T350181) (owner: 10Jdlrobson) [22:46:54] ty .. and it works. [22:47:05] perfect [22:47:34] (03Merged) 10jenkins-bot: Update wgStatsTarget to port 9125 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982867 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [22:47:38] (03Merged) 10jenkins-bot: [BC] Enable desktop diff and history pages on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982925 (https://phabricator.wikimedia.org/T350181) (owner: 10Jdlrobson) [22:47:56] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1107.eqiad.wmnet with OS bookworm [22:48:01] !log jhuneidi@deploy2002 Started scap: Backport for [[gerrit:982867|Update wgStatsTarget to port 9125 (T240685)]], [[gerrit:982925|[BC] Enable desktop diff and history pages on mobile (T350181 T353388)]] [22:48:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm [22:48:08] T240685: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 [22:48:09] T350181: Enable desktop diff page on mobile site - https://phabricator.wikimedia.org/T350181 [22:48:09] T353388: Enable desktop history HTML on mobile - https://phabricator.wikimedia.org/T353388 [22:49:33] !log jhuneidi@deploy2002 jhuneidi and jdlrobson and cwhite: Backport for [[gerrit:982867|Update wgStatsTarget to port 9125 (T240685)]], [[gerrit:982925|[BC] Enable desktop diff and history pages on mobile (T350181 T353388)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:49:54] cwhite: do you need to check anything? [22:50:44] jeena: should be ok to proceed. we would have seen a problem by now if one manifested [22:50:50] okay thanks [22:50:54] !log jhuneidi@deploy2002 jhuneidi and jdlrobson and cwhite: Continuing with sync [22:53:04] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001" [22:54:13] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001" [22:54:15] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1005.eqiad.wmnet with OS bullseye [22:54:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host sessionstore1005.eqiad.wmnet with OS bullseye completed: - sessionstore... [22:56:16] (03CR) 10Jforrester: "check experimental" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/945576 (owner: 10L10n-bot) [22:56:18] (03PS1) 10Ryan Kemper: elastic: slightly simplify site.pp [puppet] - 10https://gerrit.wikimedia.org/r/982930 [22:57:12] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001" [22:57:21] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982930 (owner: 10Ryan Kemper) [22:57:43] !log jhuneidi@deploy2002 Finished scap: Backport for [[gerrit:982867|Update wgStatsTarget to port 9125 (T240685)]], [[gerrit:982925|[BC] Enable desktop diff and history pages on mobile (T350181 T353388)]] (duration: 09m 42s) [22:57:51] T240685: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 [22:57:51] T350181: Enable desktop diff page on mobile site - https://phabricator.wikimedia.org/T350181 [22:57:52] T353388: Enable desktop history HTML on mobile - https://phabricator.wikimedia.org/T353388 [22:58:19] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001" [22:58:20] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1006.eqiad.wmnet with OS bullseye [22:58:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host sessionstore1006.eqiad.wmnet with OS bullseye completed: - sessionstore... [22:58:47] everything looks good from my end. thank you again! [22:58:57] UTC late backport window closed. Thanks for your patience everyone [22:59:41] RECOVERY - cassandra-a CQL 10.192.32.229:9042 on restbase2032 is OK: TCP OK - 1.065 second response time on 10.192.32.229 port 9042 https://phabricator.wikimedia.org/T93886 [23:02:12] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1107.eqiad.wmnet with reason: host reimage [23:02:52] (03CR) 10Bking: [C: 03+1] elastic: slightly simplify site.pp [puppet] - 10https://gerrit.wikimedia.org/r/982930 (owner: 10Ryan Kemper) [23:03:00] (03CR) 10Ryan Kemper: [C: 03+2] elastic: slightly simplify site.pp [puppet] - 10https://gerrit.wikimedia.org/r/982930 (owner: 10Ryan Kemper) [23:05:28] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1107.eqiad.wmnet with reason: host reimage [23:13:24] (03PS1) 10Ryan Kemper: wdqs: decom wdqs10[09-10] [puppet] - 10https://gerrit.wikimedia.org/r/982933 (https://phabricator.wikimedia.org/T351671) [23:17:36] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply new extra plugins - bking@cumin2002 - T353270 [23:17:44] T353270: Update elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 [23:21:55] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1107.eqiad.wmnet with OS bookworm [23:22:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm completed: - elastic1107 (**PASS... [23:28:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:30:44] (03PS1) 10Jdlrobson: Add wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982936 (https://phabricator.wikimedia.org/T350181) [23:30:52] (03CR) 10CI reject: [V: 04-1] Add wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982936 (https://phabricator.wikimedia.org/T350181) (owner: 10Jdlrobson) [23:31:04] (03PS2) 10Jdlrobson: Add wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982936 (https://phabricator.wikimedia.org/T350181) [23:33:34] (03CR) 10Jdrewniak: [C: 03+2] Add wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982936 (https://phabricator.wikimedia.org/T350181) (owner: 10Jdlrobson) [23:34:16] (03Merged) 10jenkins-bot: Add wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982936 (https://phabricator.wikimedia.org/T350181) (owner: 10Jdlrobson) [23:39:36] (03CR) 10Dwisehaupt: [V: 03+2 C: 03+2] "Also got the verbal ok from jgreen but he had to head off for the night." [labs/private] - 10https://gerrit.wikimedia.org/r/982924 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [23:42:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host testhost2001.codfw.wmnet with OS bullseye [23:44:56] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102 (10odimitrijevic) Thank you @elukey! [23:48:00] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host acmechief1002.eqiad.wmnet with OS bookworm [23:48:00] !log brett@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host acmechief1002.eqiad.wmnet [23:48:05] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T353295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host acmechief1002.eqiad.wmnet with OS bookworm executed with errors: - ac...