[07:39:48] hello folks, I am powercycling parse1012, there was a cpu error in the racadm getsel [07:41:08] also depooled it [07:43:21] I don't see anything weird for the moment but I'll leave the pool action to you (in case you want to double check) [09:08:09] elukey: thanks <3 [10:00:45] folks if nobody opposes I'd merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/886862 and roll it out [10:01:06] it seems really harmless and causing a ton of spam [10:01:19] (also pods are trying and failing to use non-kafka nodes) [10:04:17] fine by me [10:05:15] ok proceeding [10:16:27] regarding parse1012, SOP for this issue (according to dell anyways) is to update the firmware, clear the log, and see if it happens again [10:42:49] change to eventgate-logging-external rolled out, all good afaics [10:45:39] jayme: opinion on what to do with parse1012? [10:47:40] claime: if you want to be extra sure, we can open a task to the ops-eqiad folks to upgrade the firmware, or we can repool and keep watching it (if it re-happens soon we can depool it again and cut a task to dcops, likely to follow up with dell) [10:48:05] elukey: Isn't there an upgrade-firmware cookbook we can run ourselves? [10:48:20] ah really? [10:48:23] I think so [10:48:26] Let me check [10:48:53] cgoubert@cumin1001:~/cookbooks$ sudo cookbook -l | grep firm [10:48:55] | `-- sre.hardware.upgrade-firmware [10:48:57] yup [10:49:22] very nice, then I think we can try it [10:49:30] let's check previous usages of it in phab [10:49:33] just to be sure [10:49:36] yep [10:50:21] claime: sorry, I have zero context currently [10:50:38] jayme: no problem. [10:52:48] what I'd do is the following - repool the node and keep it monitored, these issues may happen from time to time. If it re-happens, we can cut a task to ops-eqiad and ask to them what is the best option [10:53:11] I am worried about randomly upgrading firmwares without them knowing [10:53:20] Understandably [10:53:45] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Kubernetes v1.23 multi master setup is broken - https://phabricator.wikimedia.org/T329826 (10JMeybohm) [10:53:55] Let's do that then, I'll repool it [10:54:14] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Kubernetes v1.23 multi master setup is broken - https://phabricator.wikimedia.org/T329826 (10JMeybohm) p:05Triage→03High [10:54:28] super, I rechecked getsel on idrac and nothing new popped up [11:05:25] 10serviceops, 10Kubernetes: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10JMeybohm) p:05Triage→03High [11:19:10] pro-tip: cookbook -lv gives you also a one-line description of the cookbook :) [11:19:39] thanks :D [11:29:16] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Clement_Goubert) The above patch removes `x2` from the core databases, and removes the now unu... [11:40:14] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Ladsgroup) Yes, that's the way we should do it given Manuel's comment above and my basic under... [11:47:27] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10CDanis) [11:49:25] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Clement_Goubert) Thanks @Ladsgroup once the spicerack release is done I'll test the cookbook p... [11:54:16] 10serviceops, 10Data-Persistence, 10Toolhub, 10Datacenter-Switchover: What should happen to Toolhub during the 2023 DC switch? - https://phabricator.wikimedia.org/T329319 (10Clement_Goubert) That seems good to me, as long as you're ok with the downtimes and maintenances. [16:39:22] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Kubernetes v1.23 multi master setup is broken - https://phabricator.wikimedia.org/T329826 (10JMeybohm) [17:29:31] 10serviceops, 10Data-Persistence, 10Toolhub, 10Datacenter-Switchover: What should happen to Toolhub during the 2023 DC switch? - https://phabricator.wikimedia.org/T329319 (10bd808) >>! In T329319#8621299, @Clement_Goubert wrote: > That seems good to me, as long as you're ok with the downtimes and maintenan... [22:49:46] 10serviceops, 10SRE, 10Traffic: Upgrade envoyproxy to 1.16.2 - https://phabricator.wikimedia.org/T271407 (10BCornwall) Envoy seems to be on 1.18.2 now. Can this be closed, or was there any other deployment need this ticket addresses?