[00:10:40] andrewbogott: :(( I've had a lot of dev environment rot lately myself. [00:44:01] * bd808 off [06:53:46] * andrewbogott throws up hands and stomps off to bed [10:33:57] morning! still not fully back body clock wise but getting there :D [10:42:06] same here :D [10:42:21] this is awesome, we create the toolforge-deploy upgrade branch already :) [10:42:24] https://usercontent.irccloud-cdn.com/file/36ymDbg7/image.png [10:42:49] and it's assigned to me xd https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/154 [10:43:47] and it has the bug in the commit \o/ awesome [10:49:54] cool [11:04:10] * dcaro lunch [16:19:45] the sge 'task' queue spiked quite heavily at around 18:00 [16:19:54] dcaro: the cron/mtime alert fired briefly but then recovered. [16:20:11] so I think we're good for the moment. Thank you, if you fixed it :) [16:20:16] it self-fixed no? or someone fixed it [16:20:26] i've been scaling down the grid quite aggressively over the last day or two, I wonder if I removed a node or two too much [16:24:21] the queue is going down, I'd add one node back just in case [16:56:18] * dcaro off [17:32:03] taavi: right now I'm seeing failures with contacting the grid -- but I'm not sure if the grid is misbehaving or if I'm just doing something wrong. [17:32:06] https://www.irccloud.com/pastebin/9NkRE2aH/ [17:32:12] ^ should work, shouldn't it? [17:33:08] same behavior on tools-sgegrid-master [17:34:27] Oh, probably this is the exact issue you're currently working on [17:35:18] probably.. the add instance cookbook seems a bit broken on cloudcumins and it added a new node before it was provisioned, I'm currently waiting for puppet to do its thing and fix that instance [17:36:10] ok! Sorry for the ping, I'll step back [17:44:29] andrewbogott: it seems like tools-sgeexec-10-23 is having trouble reading or writing anything in the .system_sge fonder on NFS - any ideas why? [17:45:22] No! I checked for r/w but having it specific to one dir is weird [17:45:39] what's the full path? [17:46:27] actually right now I can't even 'ls /data/project' [17:46:29] /data/project/.system_sge/gridengine/default/common/bootstrap is the specific file it tries to write [17:46:46] that would be a problem that'd explain many things [17:48:43] it was stuck and now is unstuck... [17:48:52] and now stuck again [17:49:51] nfs cpu usage is quite variable on that host [17:49:57] the nfs server I mean [17:50:04] I'm reluctant to reboot it because that's such a pain... [17:50:15] it might just be under excessive load [17:51:37] iftop suggests something on tools-k8s-worker-39 [17:52:28] Hm... if we reboot that worker node it might get us some relief although likely whatever it is will just start up again [17:55:46] might have been listeria. I restarted that webservice [17:57:12] Happen to know why OS-EXT-SRV-ATTR:user_data is so enormous for that host? I havent' seen that before but maybe that's the new normal for worker nodes [17:57:16] * andrewbogott checks a different one [17:58:20] yes for other worker nodes but no for other new VMs [17:58:28] taavi: is that on purpose/something you know about? [17:58:39] nfs is still struggling [17:58:45] on which host? [18:00:41] ^ is that a question about nfs or about the userdata thing? [18:01:26] the userdata thing [18:02:41] tools-k8s-worker-39 [18:02:43] but also others [18:03:18] It may be expected, I'm just surprised at the 'openstack server show' output which is extremely wide [18:04:05] decoding that base64 it looks like that's just how cloud-init vendordata was delivered when that vm was created years ago [18:04:25] That seems likely, I'll disregard. [18:04:33] nfs is still running at 250% cpu and my grid queries are timing out. [18:04:39] I can't think why this would have changed... [18:04:47] unless there are new nodes that aren't throttled properly [18:04:59] Or if the nfs server is just having a fit of some sort [18:05:10] * andrewbogott digs into the throttling question [18:06:01] I don't think traffic shaping is set up properly `tools-k8s-worker-39`, various `tc list` subcommands are empty [18:09:20] I'm hunting down the puppet code that should manage that, can you see if there's shaping on other toolforge nodes? [18:10:44] oh no i was just looking incorrectly, it does show some rules if you specify the interface [18:12:22] ok [18:15:01] i killed the possibly problematic pod, let's see if the problem moves to a new node with it [18:15:33] ok. [18:15:49] There's not a backup running on the nfs volume currently [18:16:59] load is dropping, for the moment at least [18:18:31] and still [18:20:19] I suspect something is still holding a lock on /data/project/.system_sge/gridengine/default/common/bootstrap, tools-sgeexec-10-23 is able to read it but not write which is a problem [18:21:05] yeah, the nfs server is happier but I still can't qstat [18:21:50] Something may have timed out without releasing the lock [18:21:50] i'll boot the sge master [18:22:00] that's my best guess as well [18:26:15] doesn't seem to have heloed [18:26:49] rebooting the shadow too [18:27:28] bd808 do you have time to look at my new little stop-the-grid-for-a-tool scripts? (Of course I haven't tested them yet because... broken grid) [18:28:32] taavi: any idea how locking works there? If it's a file we might just need to rm it [18:29:13] andrewbogott: no clue, or how that file is provisioned [18:29:25] ok [18:29:52] I'll at least look what the nfs server thinks about that file [18:33:40] trying to delete the file on the nfs server doesn't work either [18:36:50] that's interesting :( [18:36:59] So are we converging on a reboot of the nfs server? [18:37:14] It's annoying but it wouldn't be the first time [18:38:00] I'm out of ideas, so yes unless you have some [18:38:08] nope. [18:38:14] I'll try a soft reboot and get ready for a hard one. [18:40:26] I'm starting a full k8s cluster reboot, doing that now is easier than hunting down the few tools that'll have an issue with that [18:40:46] nfs server is back and that file isn't frozen anymore [18:41:04] but I would not say that the grid is especially working. [18:41:11] Reboot grid master and shadow again? [18:41:31] sure, you want to do it or should I? [18:41:42] actually, I may be mistaking the issue, one second... [18:41:48] qstat seems to work now [18:42:09] yep, agreed, it was just the exec node I was trying that was locked up [18:42:19] OK, so... let me look for sge nodes that are messed up now... [18:43:22] running sudo cumin -t 30 'O{project:tools}' "ls /mnt/nfs/labstore-secondary-tools-project" [18:43:51] timeouts are: tools-k8s-worker-[30-32,35,38,40-41,45-48,50,52,54-56,59-60,64-65,67,69-72,76-80,82-83,85-92,94,98,100].tools.eqiad1.wikimedia.cloud,tools-sgeexec-10-[15,17].tools.eqiad1.wikimedia.cloud,tools-sgeweblight-10-[17,21,30,32].tools.eqiad1.wikimedia.cloud [18:44:05] I'm ignoring the k8s nodes in that list, rebooting the others [18:46:46] done, no more timeouts outside of k8s [18:48:07] I guess we should've started with the nfs reboot [18:48:47] taavi: want to send a status email or shall I? [18:48:51] go for it [18:49:26] Just grid outage, right? No k8s breakage that we know if? [18:50:26] i suspect many k8s nodes will have issues now that we rebooted the nfs server. those will get fixed with the reboots, but that'll take an hour or two to complete [18:59:24] taavi: before you catch your breath, have time for a mostly trival code review? (regarding the komla toolkit for killing grid jobs) [18:59:36] I'm asking because of the 'only one reviewer per pr' thing [18:59:50] sure, but just a moment [19:00:39] thanks, https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/merge_requests/1 [19:04:25] added some comments [19:15:57] thanks. I'm concerned about the crontab thing since that mv is what we've be relying on all along [19:24:08] taavi: won't the quota change prevent webservice from being able to start anything even if it tries? [19:30:14] andrewbogott: catching up on backscroll. I can make time to look at your stuff today, yes. [19:30:27] right now I need to feed myself though :) [19:30:31] * bd808 lunch [20:33:34] bd808: the pr in question is https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/merge_requests/1 [21:50:29] andrewbogott: I left some comments. I think you are very close, and I was glad to see parts in there that my brain had forgotten (the grid quota stuff you mentioned) [21:51:26] Thanks! Are your comments beginning with 'My recollection of today's discussion' agreeing with me that it should require admin intervention to re-enable? [21:52:49] no, it's actually agreeing with Taavi I think, but I don't want to battle about it as long as every tools admin knows how to re-enable. [21:53:21] hmmm... does every tools admin have access to that script or will it end up in a place that needs cloud root? [21:54:21] I don't think I have an opinion, just trying to understand. If we don't require admin intervention to restart the tool how will we get a note on the phab task? [21:54:41] Are y'all imagining some kind of more self-serve thing where there's a wrapper script that does things? [21:55:45] bd808: My vision for this workflow is that someone (komla) would literally log on to the two hosts and run the scripts (or do so with cumin). So anyone with root on those two VMs can do all the things. [21:55:53] I'm trying to think through the conflicting goals of tracking things on Phab and not requiring some of us to idle in the irc channel and other comms channels for the end of year break [21:56:15] Ah, I see. [21:56:26] I find it unlikely that komla will suddenly start helping on irc etc [21:56:35] I guess at the very least we should drop a "why is my tool broken' readme [21:57:14] make it tell them balloons's phone number so they can page him ;) [21:57:56] Perfect! I'm already listed as the global holiday on call 🙂 [21:58:35] But yes, we want to make sure that people understand what's going on when they respond and that they're directed to phabricator [22:00:25] balloons: if you would talk this out with andrewbogott in the context of his new script that would be great -- https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/merge_requests/1 [22:00:54] I don't really care too much other than not wanting the world to be blocked on us showing up if possible [22:03:36] https://www.irccloud.com/pastebin/jB2cycm6/ [22:08:07] > it will be a subtask of https://phabricator.wikimedia.org/T314664 [22:08:07] should link to the board instead [22:08:20] ok, have that link handy? [22:08:35] and yes, I agree "who will be around on the holidays to run the script" is quite important here [22:08:36] https://phabricator.wikimedia.org/project/view/6135/ [22:11:35] I will mostly be around until the 29th, but also my opinion is the possibly-unpopular "If you ignored our emails for months then you don't get to make your emergency our emergency" [22:12:10] andrewbogott: yes, but we are actively expecting end users to whine and not just maintainers [22:12:38] true [22:12:48] which means that seeing the $HOME is not guaranteed at all, nor is that these individuals ever knew the risks [22:13:44] Ugh, I hate the "users are abandoned by and so those users become my users" scenario [22:13:56] ^ not suggesting any particular consequence, just whining [22:14:08] * bd808 points to GLAM tech needs reports and whistles [22:15:00] I have encountered a need for another refactor in my patch and also have a flu shot appointment, so it'll be a couple hours before I update my PR. [22:15:14] I'm trying to run through these scenarios... [22:15:18] I don't like it in that the WMF does not really recognize that babysitting a large number of abandoned tools is a job that we should pay people to do [22:15:35] If a user wants the tool turned back on, that can /only/ be done by admins right? Because a random user can't access the tool anyway. [22:15:44] correct [22:15:55] Whereas if the admin wants the tool turned back on, then... probably they'll see the DISABLED file I'm dropping into their directory [22:16:29] So does that mean the "who will watch IRC all holdiay" question is somewhat unrelated to the "what is the mechanism for disabling/enabling tools" question? [22:17:28] as long as we want 100% of stopped tools to have to talk with an admin I guess that's right [22:17:43] I'm not personally sold on that, but I understand the argument [22:18:35] I'm not totally following what the alternate proposal is but I will have to catch up on backscroll post-jab [22:18:45] I think the risk of a large number of tools being noticed down by their maintainers and restarted without any sign is very low [22:20:03] the whole point of the stoppage now is to cause that sort of thing to happen, but I think most of the noticing will be by end users. I also think they will mostly complain on village pumps if anywhere. [22:20:46] whatever we end up with and everything else we talked in the meeting earlier today needs to be communicated to cloud-admin@ I think, just so anyone not in that meeting (I guess mostly TNT these days?) will have the same info as we do [22:21:37] +1 to this all needing an email to cloud-admin@ when we know what the runbook is [23:39:34] bd808, you think the folks most likely to notice will be users and not maintainers? If so I suspect we wouldn't see those complaints as much on IRC, rather other on-wiki and off-wiki places [23:40:55] I agree there's more discussion and nuance to hash out here, so we don't need to rush anything. A summary of actions and a runbook anyone on cloud-admin can execute are certainly precursors to taking action [23:47:17] balloons: I could be very wrong, but yes I think if anyone notices that these tools are down it is more likely to be the users of the tools, at least for webservices. How many of those people end up directed to IRC is likely a function of where they first yell for help and if folks in those locations send them to IRC. WP:VPT could reasonably send them to IRC. The enwiki discord will probably ping t.aavi, TNT, or me directly.