[21:01:20] !log admin `sudo service maintain-dbusers restart` on cloudcontrol1005. Report of missing replica.my.cnf and journalctl output empty due to log rotation. (T382962) [21:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [21:01:26] T382962: Missing replica.my.cnf for freshly created Toolforge account vehicle-keeper-markings - https://phabricator.wikimedia.org/T382962 [21:09:33] since you're around, bd808, would you drain a k8s node for T382863? [21:09:34] T382863: cfdw-28928147-9qtjx stuck in Terminating state - https://phabricator.wikimedia.org/T382863 [21:11:00] * lucaswerkmeister looks at the docs [21:15:43] If you get stuck lucaswerkmeister I think a.ndrewbogott is still about today. There should be a cookbook that does the needful. [21:15:54] yeah I found https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Drain_and_undrain_a_node and the host to SSH into [21:16:00] trying to figure out what I would need to do afterwards [21:16:22] I assume it’s drain (cookbook 1), reboot (TODO), uncordon (cookbook 2)? [21:16:49] I was also able to SSH into the nfs-69 node (nice) and there are indeed some processes stuck in D there [21:17:08] reboot via horizon? [21:17:14] yeah that sounds right. the reboot can be by ssh into the box or with the horizon web ui [21:17:32] https://sal.toolforge.org/tools also shows a reboot cookbook 🤔 [21:17:54] the cool kids have lots of cookbooks :) [21:17:57] ^^ [21:18:05] I’ll give that a try and hope enough people are around in case I mess up ;) [21:18:09] (thanks!) [21:18:14] I'm running around with rocks and sticks from the olden days [21:18:29] * lucaswerkmeister wasn’t going to comment on `service restart` vs. `systemctl restart` [21:19:46] get off my lawn! [21:20:02] :omya_systemd: [21:20:52] sudo: cookbook: command not found [21:20:55] guess I’m not on the right host then [21:21:11] (tools-cumin-1.tools.eqiad1.wikimedia.cloud) [21:22:39] huh. I would have thought that was the place too. andrewbogott do you know where to run a cookbook from as a Toolforge root? [21:22:57] cloudcumin1001? [21:23:05] I wonder if you have to be on a cloudcumin node? [21:23:13] based on SAL [21:24:19] In theory you can setup your laptop to run them locally, but I've never done that myself. [21:27:57] hm, not sure I have access to cloudcumin1001.eqiad.wmnet [21:28:44] at least my usual SSH setup can’t even resolve it and -J bastion-cloud hangs – it feels like a host name that needs production shell access, which volunteer Lucas doesn’t have [21:31:33] commented at https://phabricator.wikimedia.org/T382863#10429998 [21:31:49] lucaswerkmeister: yeah, I don't see you in /etc/passwd there. I think that host is wmcs-roots only [21:33:12] alright, then that’s just beyond my powers to fix and that’s fine :) thanks for walking me through it anyway! [21:54:56] thanks [21:55:22] lucaswerkmeister: {{done}} and I learned that wmcs.toolforge.k8s.reboot is really all we needed. It does the whole dance of cordon, drain, reboot, and uncordon apparently. That cookbook can be used to do a rolling restart across the whole cluster if needed. [21:55:27] yw JJMC89 [21:57:21] lucaswerkmeister: toolforge roots really should be able to run those cookbooks. it might be worth opening a phab task to see if tools-cumin-1 can be fixed up. [21:57:56] also, I would be happy to have you as a wmcs-root if you are interested in that bigger hat [22:02:28] oh nice [22:02:40] I’ll open a task for the former, not sure about the latter yet ^^ [22:06:19] filed T382977 [22:06:20] T382977: Allow Toolforge roots to reboot k8s worker nodes (without wmcs-root) - https://phabricator.wikimedia.org/T382977 [22:10:15] sorry, catching up... [22:10:49] andrewbogott: its all good. T382977 is the thing that might be fixable. [22:11:25] I see! [22:11:54] So in theory a toolforge root can log in to the k8s controller and do all the things. But yeah, there's not easy access to the cookbook. [22:12:08] I think that ask is worth fixing but don't immediately know the right way to do it [22:12:11] *task [22:12:29] yeah, and there is a tools-cumin-1 node but it seems not to have cookbooks [22:13:02] hm, maybe this is simple then [22:13:59] (JFTR, I found tools-cumin-1 by looking for nodes with “cumin” in the name in openstack browser, that’s literally all the indication I have that this looked like a node that might possibly be the right one ^^) [22:14:17] well, it's simple /if/ the cookbooks are split into 'need novaadmin' and "don't need novaadmin" and "act on idrac" :/ [22:14:39] that reboot one might actually use novaadmin [22:15:24] actually it must [22:15:36] it /could/ shell into the VM and reboot it there [22:15:38] but I bet it doesn't [22:16:40] yeah, if it shelled in then it would also have to poll for the host to come back up. the openstack api would make that less awkward to code [22:17:06] * bd808 gives everyone prod root and solves all the problems [22:18:01] I suspect that this was all gamed out for volunteer wmcs roots and not at all for toolforge roots. But I pinged francesco on the task in case he already has a plan for this. [22:19:12] Another fix might be making the local install instructions more discoverable I guess [22:19:50] does that help with the novaadmin thing? [22:20:16] last time I tried local cookbook install was kind of a pain [22:21:14] I'm not sure actually. I don't know if local most uses your personal creds or bounces through to shared creds somewhere. This is all basically a mystery to those of us who haven't built out the system. [22:21:20] *host [22:21:32] tbh a few of the hops are a mystery to me as well [22:24:00] I guess the other question is: how often does/will this come up? lucaswerkmeister was this a scenario where you specifically needed a safe reboot of the node, or would targetted killing of the one job have been enough? [22:24:43] it appeared to be the hung job state that only clears with a node reboot (aka NFS hang) [22:25:48] it is a thing that can wait to happen later as long as we have more exec nodes that are still working as expected [22:27:07] ok, but does ultimately require a reboot. Which is pretty annoying to do without a cookbook. [22:28:51] lucaswerkmeister: sorry if our infra invited you to help, and then stopped you from helping :( [22:30:48] andrewbogott: it wasn’t a scenario where I specifically needed anything, I just saw in here that it seemed to need a reboot by someone™, I had some free time, and wondered if I could do it [22:30:57] so the task isn’t urgent as far as I’m concerned :)