[09:54:04] dcausse: i'll be like 20mins late for the meeting [09:54:18] I have to rush somewhere [09:54:28] ejoseph: our meeting is at 11:30 [09:57:50] Oh [09:57:59] I keep forgetting Tuesdays [10:00:47] :) [10:41:03] lunch + errand [11:54:09] lunch [12:37:50] dcausse: for which JDKs and distros should I build/upload jvmquake? initially JDK11 for buster, something else? [13:04:44] Greetings [13:11:34] moritzm can you do Stretch, Buster, and Bullseye if it's not too much trouble? [13:12:04] sure, thing, all for 8 or 11? [13:22:21] 8 for buster and stretch, 11 for bullseye [13:30:13] dcausse I noticed a lot of GC alerts for cloudelastic, do I need to mitigate as described at https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell ? [13:53:59] inflatador: no because they're due to a reindex I think [13:57:29] ah OK, will hold off then [13:57:34] we might want to tune this alert I guess [14:58:31] \o [14:59:06] if jvmquake works, hopefully that can replace the GC hell alerting. Unfortunately the current values there don't manage to alert when eqiad/codfw hosts get into trouble, but do alert when cloudelastic is performing nominally [15:02:41] dcausse, ebernhardson: did you receive 2 tasks to grade for our SrSWE position? [15:04:28] gehel: yes, i looked over them but wasn't sure how we wanted to grade them [15:05:07] We had talked previously about not really grading them, or rejecting anyone based on the submitted code, but rather to use it as something to discuss in later interviews? [15:07:41] gehel: yes and graded the two [15:08:33] ebernhardson: yep, the grading is mostly about creating questions for the next step and only rejecting if the submissions are completely insane [15:09:14] I think we can look over the notes when the next round of interview is scheduled [15:09:22] but for now I can't seem to retrieve them [15:40:48] i guess we could re-tune the old gc alerting to be different for eqiad/codfw vs cloudelastic [15:42:11] * ebernhardson also has to learn how to get code merged in gitlab...i guess this is the github model where we are expected to manually name branches and maintain forks? [15:42:55] where would i find or add something to someones gitlab review queue, do we have review queues? [15:44:22] I think you can assign reviewers once you created the MR [15:45:49] I think you can push directly to a new branch of the project without forking [15:45:56] where would i see the things that have been assigned to me (currently should be empty)? [15:46:23] https://gitlab.wikimedia.org/dashboard/todos ? [15:46:38] ahha, yea that looks like it [15:47:10] ahh, there is also an icon in the top left [15:47:33] err, top right [16:00:28] dcausse: imported the builds. I found two more bugs I fixed locally, I'll send a patch in the next days (will only really matter when we move to a new jvmquake release) [16:00:33] or Java 17 :-) [16:00:47] moritzm: thanks! :) [16:02:41] Thanks m-moritz, just saw the repropro emails [16:08:17] workout, back in ~30-40 [17:08:17] * ebernhardson ponders the insanity of having the rolling restart scripts poke _tasks api and avoid restarting nodes until they are done, but would be tedious [17:08:41] err, i suppose thats not clear. done reindexing, because if you restart a node thats hosting a reindexing task the reindex will fail and need to restart [17:09:51] there must be coordination I think, solution is perhaps having the reindex operations ran by cumin so that it's all centralized [17:10:31] yea, it would be much simpler if there was a central flag saying don't restart anything in the cluster [17:11:46] and back [17:12:24] Wouldn't mind adding that logic to the cookbooks...not sure how long that would delay an individual host restart, but it sounds doable in theory [17:12:38] separately makes me wonder about the regularly failing dumps. I had thought they might be related to restarts but they fail too often for that. seperate investigation needed :S [17:13:37] yes, I suspet the scroll api to be too fragile but could be other things [17:14:56] sigh I thought we had merged this: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/526621 but looks like I forgot to move it forward [17:15:08] that would help to get rid of the server-side scroll [17:17:38] inflatador: i don't know that it would be too hard, but i wonder if it's overly complex. Mostly you fetch elastic:9200/_tasks and that returns a list of nodes, and each node contains a list of tasks, and tasks with action indices:data/write/reindex indicate something that shouldn't be interrupted [17:18:11] maybe, i would have to poke it more to verify those are the correct tasks [17:18:26] Something I need to learn more about regardless ;) [17:20:11] dcausse: hmm, indeed. I'll check that patch again but it seems it wasn't far from merging [17:23:16] relatedly https://github.com/elastic/elasticsearch/issues/42612 did not get much love :( [17:29:31] dcausse: at least some progress was made, but indeed they don't seem to have made it far down that checklist :( [17:30:42] looking closer they already added some new fields like seq_no, haven't heard about them [17:36:56] Hello. Does anyone here make use of the hadoop worker nodes that have the GPUs for anything? I have to schedule a reboot of them, so I'm trying to find out who might be inconvenienced by it and how I can minimize that. Thanks. [17:39:22] btullis: ebernhardson is probably the sole person in this channel possibly playing with GPUs, but we don't have any production jobs as far as I can tell [17:41:30] btullis: yup nothing or ours uses gpu's [17:41:42] Great. Many thanks both. [17:51:31] lunch, back in ~30 [18:18:48] dinner [18:23:54] back [19:30:51] gehel: inflatador: will be 4 mins late to pairing [19:31:01] ack [21:05:17] inflatador: ebernhardson: see following snippet [21:05:20] https://www.irccloud.com/pastebin/90T94yQB/ [22:07:04] sneaky error messages...is errored on mar 3, 6, 10, and 13. The code to log more info was deployed on the 14th and it hasn't fired since :P [22:07:28] haha [22:08:20] (╯°□°）╯︵ ┻━┻ [22:16:55] * ebernhardson suspects its because now the error goes in a 100M+ file of daily output instead of logstash :S [22:23:36] Okay, I fixed the codfw cluster settings as well so all the cirrus setting check alerts are fixed now [22:25:22] (And moved https://phabricator.wikimedia.org/T301511 to needs reporting) [22:25:32] ebernhardson :S why not in logstash anymore? [22:28:25] ryankemper: i wasn't thinking about it at the time, but i switched the errors from an `undefined index` which is a programming error and reported via logstash, to use the maintenance script error reporting which i forgot only goes to stdout [22:28:45] next step is to duplicate errors generically for maintenance scripts into logstash [22:28:54] ah, makes sense [22:28:55] and nice