[08:47:03] dcaro: that k8s message rings a bell. I have seen that before, but I don't thing it was an indicative of anything wrong happening [10:30:45] topranks: hey, do you prefer doing the gnmi thing here or on meet? [10:31:57] hey! meet is probably best maybe? [10:32:08] can we push it back by 15 min though? if that works for you? [10:32:27] I'm just getting my machine set back up properly after the trip (only got back last night) [10:32:51] sure, I'm in https://meet.google.com/aba-ybqy-ngi [10:33:05] whenever you are ready [10:33:09] ok [10:33:11] thanks <4 [11:29:42] dcaro: you mentioned "released components-cli 0.2.0 on tools" in #-daily, but I don't see that version in gitlab, am I looking in the wrong place? [11:38:22] let me check [11:38:52] https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/releases/debian%2F0.0.2 [11:39:30] * dcaro typing error generato [11:39:35] *generator [11:39:38] the version was 0.0.2 [11:42:15] ah ok! :) [11:46:59] arturo: can you reassure me that cloudgw1001 is inactive before I start the decom? [11:47:14] yes, let me check [11:48:21] ty [11:49:02] andrewbogott: cloudgw1004 is primary, cloudgw1001 is secondary, wmcs.openstack.network.tests reports all OK [11:49:23] great! I'm starting the reimages and then having breakfast [11:49:47] ok, dont forget to merge the puppet patch before the reimage! [11:50:15] Already done :) [11:52:49] 👍 [12:26:48] partman failed on 1003 because of course [12:36:50] worked the second time for some reason [12:43:24] standard behavior yep xd [12:44:13] :-( [12:49:31] andrewbogott: did you do the netbox DNS changes? [12:49:41] I haven't touched netbox [12:49:51] https://phabricator.wikimedia.org/T382356#10520717 [12:50:02] I'll do them [12:50:06] thx [12:50:14] 1003 just started its initial puppet run [12:51:02] ok [13:01:42] andrewbogott: how is that going? [13:01:58] It's done with the initial puppet run, wrapping up [13:06:44] ok [13:09:14] arturo: ready! [13:09:51] ok [13:10:58] let me clear the NXDOMAIN for vlan1120.cloudgw1003.eqiad1.wikimediacloud.org [13:12:38] ok, the wmcs.openstack.network.tests cookbook is back to green [13:13:08] great. time to fail over? [13:13:24] yes! [13:13:42] * andrewbogott crosses fingers, knocks wood, etc. [13:13:52] you do it? or I do? [13:13:56] go ahead [13:14:53] done [13:15:08] seems fine! [13:15:16] stashbot, still here? [13:15:30] hmmmm [13:16:03] -_- [13:16:37] I think we expected brief bot interruption? [13:16:46] stashbot, I can see you in the sidebar [13:17:03] yeah, unfortunately, always happens [13:17:15] do I need to restart it or will it do that on its own? [13:17:24] on its own, [13:17:28] ok [13:17:30] yeah, nothing to do [13:17:45] do you have an opinion about whether I decom 100[12] now or later? [13:18:34] I don't have an opinion. I don't think we will ever need them, they are faulty hardware in my opinion [13:18:47] * arturo apparently has an opinion then [13:19:16] ok, I'll start decom [13:19:30] ok! [13:26:06] ...should I be worried about stashbot yet? [13:26:49] just came back [13:27:11] oops just in time for me to restart it [13:27:21] stashbot, working now? [13:27:21] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [13:27:26] * andrewbogott shrugs [13:33:19] thanks arturo! Going to close some phab tasks now :) [13:33:47] andrewbogott: I see there are new per-HV CPU panels on https://grafana-rw.wikimedia.org/d/000000579/wmcs-openstack-eqiad-summary it is possible it is making the dashboard slow to load? [13:34:37] I'm sure they make it slower although it still loads for me in a tolerable amount of time [13:35:02] you can remove them if they're too much trouble; I added them when I was adjusting the overprovision ratio [13:35:37] collapsing them in a section is also a way to prevent them from loading by default [13:41:07] maybe a dedicated dashboard would be better [13:41:38] probably! For now I think it's fine to just remove them if they're in your way. [13:41:43] want me to do that? [13:59:22] nah! [13:59:25] no problem [14:42:33] if nobody objects, I'm gonna merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1113194 cc Raymond_Ndibe [14:43:00] it should be the last thing before we can close T362867 [14:43:01] T362867: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28 - https://phabricator.wikimedia.org/T362867 [14:53:38] dhinus: LGTM [15:03:45] dhinus: lgtm, would be really nice if we can find out how to avoid kubeadm from creating that file/force it to create it empty, but well [15:04:03] thanks, I'm merging now [16:00:30] Should I assume that since the cookbook is called 'wmcs.openstack.quota_increase' that we don't have a cookbook for decreasing quotas? [16:00:58] I don't think we have [16:01:50] ok! [16:02:39] I was planning on adding quotas support in tofu-infra [16:03:20] andrewbogott: it might support using negative numbers [16:03:43] huh, I'll try [16:04:12] would be nice to add a note in the help at least though xd [16:04:15] (if it does) [16:05:09] "Increased quotas by -16 cores" [16:05:30] seems to have worked [16:06:23] hahaha [16:06:25] nice [16:07:29] doesn't work for ram though, says 'expected one argument' [16:25:09] dhinus: can I assign you T386467? I'm assuming you've done that most recently [16:25:10] T386467: [wikireplicas] Create views for new wiki sylwiki - https://phabricator.wikimedia.org/T386467 [16:27:13] sure I'm doing all of those at the moment [16:27:40] though I would like to make that a clinic duty thing //eventually// :) [16:28:12] Speaking as the one on clinic duty right now: thank you! [16:28:37] *cough* or maybe offload to other folks *cough* [16:28:48] :D [16:29:13] yeah, come to think of it I think we're only responsible for keeping the service up and not for maintenance actions [16:29:16] in theory [16:29:50] but until I'm able to remember the name of the team who is responsible I probably don't get to complain! [16:30:05] arturo andrewbogott please add your comments to T382607 which is exactly about that :) [16:30:06] T382607: Decision request - Who runs wikireplicas cookbooks - https://phabricator.wikimedia.org/T382607 [16:34:14] dhinus: I will write a comment tomorrow, but I wonder if others would buy in whatever comes from that ticket? [16:35:43] well, I was thinking about that, I think the only tricky one is option 3 where data-platform-sre would need to do more work [16:35:54] nevermind, I read now option 3 as "we ask ...", so that's fair [16:38:10] different topic, I'm trying to understand why the KernelErrors alert fired on cloudgw1003 even with all the filters [16:39:18] e.g. it detected a "priority_crit" message, but the only one I can find should be filtered by kernel-messages-ignore-regex.txt [16:39:40] dhinus: in the past I've manually run the script with `bash -x` to see some additional debug information [16:40:04] maybe a race condition between the first run and the creation of kernel-messages-ignore-regex.txt ? [16:40:32] nah, they most likely are deployed in the same puppet run [16:40:44] I tried re-running the script now, with an extended timeline of 30days instead of 30mins [16:40:58] priority_crit and priority_err are both 0 [16:41:38] https://www.irccloud.com/pastebin/SSv5dUOv/ [16:42:03] https://www.irccloud.com/pastebin/KmTZodP5/ [16:42:21] this may be it? [16:42:31] * arturo needs to go offline now [16:42:43] arturo: np, we can catch up tomorrow [16:42:49] I might open a task [16:43:09] 👍 thanks [17:12:43] there was indeed a race condition, T386850 [17:12:44] T386850: [monitoring] KernelErrors alerts trigger incorrectly when a host is reimaged - https://phabricator.wikimedia.org/T386850 [17:13:11] or more accurately, a glitch in how the ignore-regex file is created by Puppet [17:30:40] I'm planning to update k8s on toolsbeta tomorrow or Friday (T362868), and I might break things... is anyone actively using toolsbeta this week? cc Raymond_Ndibe Rook [17:30:45] T362868: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29 - https://phabricator.wikimedia.org/T362868 [17:31:50] Safe from my perspective [17:52:04] * dcaro off [17:52:08] cya tomorrow! [20:06:33] dhinus: good finding re:kernel errors puppet race