[08:42:15] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=39a549ae-98fb-4aef-878d-0821f2d1ea4b) set by akosiaris@cumin1002 for 1:00:00 on 1 host(s) and their services... [09:38:50] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9b28fccf-ebb0-4701-b5cb-3d157b3ca2b0) set by akosiaris@cumin1002 for 1:00:00 on 1 host(s) and their services... [09:38:59] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=116c5a1a-2682-42d9-b281-94b33ec2e23c) set by akosiaris@cumin1002 for 1:00:00 on 1 host(s) and their services... [10:03:18] 10serviceops, 10API Platform, 10CirrusSearch, 10MediaWiki-Configuration, and 2 others: Provide a method for internal services to run api requests for private wikis - https://phabricator.wikimedia.org/T345185 (10daniel) >>! In T345185#9448367, @EBernhardson wrote: > Thoughts? That sounds like a reasonable... [10:04:36] 10serviceops, 10Parsoid (Tracking): Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testreduce1001.eqiad.wmnet` - testreduce1001.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertma... [10:33:17] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:36:02] 10serviceops, 10Parsoid (Tracking): Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff testreduce1002 is now working fine and testreduce1001 has been decommissioned, closing. [10:43:09] jelto: the gerrit config change can be merged any time ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/987120 ) [10:43:32] I'd just have to restart Gerrit to have it taken in account, but there is no rush for it ;) [10:43:48] ack, I'll merge the change now [10:48:20] hashar: patch merged and puppet updated the gerrit.config file [10:48:34] perfect thank you jelto ! [10:49:06] I will check tomorrow morning when I restart gerrit [10:50:06] great thanks [12:40:31] 10serviceops, 10RESTBase Sunsetting, 10Parsoid (Tracking): RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867 (10Clement_Goubert) [14:23:00] trying to deploy ipoid update to staging, and seeing `UPGRADE FAILED: release staging failed, and has been rolled back due to atomic being set: cannot patch "ipoid-staging-testing-crons" with kind CronJob: CronJob.batch "ipoid-staging-testing-crons" is invalid` [14:23:09] cc effie [14:25:33] applying to eqiad also fails [14:27:12] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b2284900-4b6b-4cc1-aba3-ee88a4fb1e3e) set by akosiaris@cumin1002 for 1:00:00 on 1 host(s) and their services... [14:27:39] 10serviceops, 10iPoid-Service: Unable to deploy ipoid to staging or eqiad - https://phabricator.wikimedia.org/T354768 (10kostajh) p:05Triage→03High [14:38:30] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1380.eqiad.wmnet with OS bullseye [14:40:01] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=32e531db-6a10-4d67-adb6-cb3288c935b2) set by akosiaris@cumin1002 for 1:00:00 on 1 host(s) and their services... [14:45:33] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:55:31] kostajh: (copying from slack) we are looking into it [14:55:58] ty [15:15:38] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10akosiaris) > Appears to be related to the wdat_wdt watchdog driver (all affected hosts have that driver). Kernels >= 5.10.205-1 should have a related patch backported (https... [15:17:42] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1380.eqiad.wmnet with OS bullseye completed: - mw1380 (**PASS**) - Downtime... [15:20:02] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1381.eqiad.wmnet with OS bullseye [15:20:42] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1382.eqiad.wmnet with OS bullseye [15:21:18] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1383.eqiad.wmnet with OS bullseye [15:22:01] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10MoritzMuehlenhoff) >>! In T354413#9450608, @akosiaris wrote: >> Appears to be related to the wdat_wdt watchdog driver (all affected hosts have that driver). Kernels >= 5.10.... [15:23:50] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10akosiaris) >>! In T354413#9450641, @MoritzMuehlenhoff wrote: >>>! In T354413#9450608, @akosiaris wrote: >>> Appears to be related to the wdat_wdt watchdog driver (all affect... [15:31:20] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:41:36] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1379.eqiad.wmnet with OS bullseye [15:48:30] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:55:04] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10Clement_Goubert) Summary of the discussion on the linked CR: - LLDP based logic runs the ri... [15:57:38] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10JMeybohm) [15:57:51] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1382.eqiad.wmnet with OS bullseye completed: - mw1382 (**WARN**) - Downtimed on I... [15:59:15] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1381.eqiad.wmnet with OS bullseye completed: - mw1381 (**PASS**) - Downtime... [16:00:18] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10akosiaris) >>! In T352893#9450792, @Clement_Goubert wrote: > Summary of the discussion on t... [16:00:26] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1383.eqiad.wmnet with OS bullseye completed: - mw1383 (**PASS**) - Downtimed on I... [16:06:57] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) For reference, here's a screenshot of more kafka metrics around enabling compaction: {F41... [16:09:25] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney) >>! In T352893#9450792, @Clement_Goubert wrote: > I am left wondering if the fear... [16:22:27] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1379.eqiad.wmnet with OS bullseye completed: - mw1379 (**PASS**) - Downtimed on I... [16:25:55] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fe00207a-5a06-4fc6-a98d-1de1261b924f) set by kamila@cumin1002 for 1:00:00 on 5 host(s) and their services wi... [16:28:47] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10Volans) I might be missing context, but why we can't get that info from netbox? Extracting... [16:32:24] 10serviceops, 10iPoid-Service: Unable to deploy ipoid to staging or eqiad - https://phabricator.wikimedia.org/T354768 (10jijiki) We are working on updating the underline template which is causing the errors you are seeing, I will let you know when it is properly fixed. Sorry for the inconvenience! [16:37:26] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1378.eqiad.wmnet with OS bullseye [16:46:13] 10serviceops, 10Release-Engineering-Team (Radar): Allow release engineering to delete images - https://phabricator.wikimedia.org/T354786 (10Clement_Goubert) [17:13:40] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney) >>! In T352893#9450929, @Volans wrote: > I might be missing context, but why we ca... [17:15:44] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1378.eqiad.wmnet with OS bullseye completed: - mw1378 (**PASS**) - Downtime... [17:16:29] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1377.eqiad.wmnet with OS bullseye [17:54:23] 10serviceops, 10MW-on-K8s: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1377.eqiad.wmnet with OS bullseye completed: - mw1377 (**PASS**) - Downtimed on Icinga/Alertmanag... [17:55:23] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [18:39:44] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10bking) [18:48:44] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10kamila) [18:49:03] 10serviceops, 10MW-on-K8s: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10kamila) 05Open→03Resolved With Alex's patches, I ran 7 reimages and 20 reboots without the issue reappearing. It might be worthwhile to understand the issue better to see if that workaround i... [19:16:02] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10bking)