[11:22:54] lunch [14:04:40] o/ [15:59:48] \o [16:02:55] hmm, kinda odd. Max completed time from 24k tasks is 11min, but there are 11 more tasks that have been running 4+ hours. One on random task i checked, the last log message is from 22:31 (5.5 hours ago) [16:04:07] last message is 'Started 1 remote fetches in 1 ms', spot checking a few of the other instances they say same thing :S [16:06:25] o/ [16:07:40] if this is the cirrus import, checking logs the last 3 runs were pretty consistent and finished within 6hours [16:08:24] yea this is the cirrus import. And indeed thats what i was expecting, this is odd :S It's at ~17h but most of the tasks finished 10h ago [16:08:33] :/ [16:17:40] it's not clear i can do anything with this either...there is no particular way to kill a single executor or fail an individual task execution that i can see, only to kill the whole thing [16:17:56] i guess if i could log into the workers could kill the relevant python executable, but cant :P [16:24:22] dcausse I just got pinged about the flink-k8s stuff on Slack. I sent you a Slack huddle invite, we're currently trying to deploy on the dse cluster [16:24:39] inflatador: nice [16:25:07] hm... huddle requires chrome or the desktop app :( [16:27:16] sorry. I'm not much for Slack myself. Looks like they have a Slack Linux app, but you need that stupid snap app store thing that Ubuntu users https://slack.com/downloads/linux [17:01:51] well, no clue what was wrong...but after poking around enough found `yarn container -signal GRACEFUL_SHUTDOWN` which allows killing individual containers. yarn then started new containers for the stuck tasks and they finished in a couple minutes [17:10:01] workout, back in ~40 [17:16:36] hmm, for now i think going to ignore this and hope it's a fluke, but i think we could enable speculative execution with a a high multiplier and quantile to have spark try and run the same task a second time if almost everything is complete and there are a few very long running stragglers [17:49:54] back [18:32:52] lunch, back in ~40 [18:56:51] meh, indices are still not being created for new wikis :S [18:58:13] sigh... [18:59:36] I wonder if we should setup an alert for those, a quick maint script or like run daily [19:00:08] i'm more an the boohiis slack crowd too, but huddle is way nicer for meetings and screen sharing than google meet. [19:00:12] you can draw on my screen! [19:00:19] and the chat conversation is saved after [19:01:09] finally installed the desktop app (took ages to download), will try these features next time I have an opportunity :) [19:01:13] I'm not sure for the new wikis...it does seem like we need something automated that does the necessary. Right now trying to convince logstash to tell me if the maintenance script was ever run (not sure we have default logs that would be generated though) [19:02:04] that'd be great to fix the addWiki script [19:02:38] i wonder if there is some way...the problem is addWiki isn't run in the context of the new wiki. It's run against a fake wiki in the appropriate database shard [19:03:05] so it doesn't get the right cirrus sharding [19:03:29] so we rely on the operator to run the cirrus scripts? [19:03:35] i guess we could override the cirrus sharding from addWiki, but feels hacky [19:03:50] dcausse: yes, although last time i checked with someone that created a wiki they were pretty sure they ran it [19:04:01] (but no index was created) [19:04:08] and they didn't get any errors from the script :S [19:04:38] yes it's too "rare" to have clear reproducible steps :( [19:05:31] still poking logs, but all i'm finding are ElasticaWrite failures and nothing about index creation (yet) [19:22:30] Silliest of solutions for now ... there is another wiki in the queue to be created so i asked to record the output and put it on phabricator :P [19:23:51] I can try to be around if this is done in EU mornings [19:27:43] not sure, looks like the most recent ones were done around 13:00 UTC so probably done in EU time [19:32:18] I should be around, I'll add gucwiki in my irc highlights [19:40:35] dinner [20:51:34] hmm, data reload on wdqs2009 failed again. I wonder if munging from NFS could be a problem. Like, we're using "latest" which is a symlink...could that cause a problem when a new dump is added and the symlink changes? [20:55:21] probably not, but I'm gonna try it again with hard-coded paths just in case. If it fails a third time, I guess we wait for 1009 to (hopefully) finish and use it for data transfers [21:21:23] also should have reused munge... ;(