[08:14:37] urbanecm: in case you wanted to troubleshoot something that happened yesterday, note that we had an incident & a huge lag for the whole day (https://grafana-rw.wikimedia.org/d/8xDerelVz/search-update-lag-slo?orgId=1) so anything was heavily delayed [08:17:16] hm.. it took 2 hours less to recover cloudelastic than production-search@eqiad [08:39:12] we'll have to move swift_upload from refinery to discolytics, seems like we're the only team still using this and it's part of the oozie folder that's getting deleted [08:39:23] context: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1092879 [13:32:32] dcausse: thanks for the info, good to know. fortunately, that doesn't impact me. i'm investigating an Add Link suggestion user completed, but...that suggestion wasn't supposed to be generated. so i'm trying to determine whether the error happened while generating suggestions, or at the editor layer. [13:32:44] unfortunately, Growth's tooling deletes the suggestion data it is completed, which makes sense, but it makes debugging trickier in cases like this [13:34:18] urbanecm: sure, we could track that page_id in our event streams but on if the problem occurred after we switched to eventbus, if it occurred before then I'm afraid we might not have this data [13:34:30] s/on/only/ [13:35:11] dcausse: that's quite possible unfortunately. i guess growth could start logging what we are dropping to logstash [13:42:32] kubectl logs does not distinguish between stdout/stderr, might be a bit annoying for extensions/CirrusSearch/maintenance/ExpectedIndices.php where we parse its output... [13:42:56] going to assume that stderr is empty :/ [14:06:27] \o [14:06:59] o/ [14:14:59] o/ [14:19:08] dcausse do you have any more info on that incident? I'm afraid to say I completely missed it ;( [14:23:50] inflatador: haven't documented that but one of the search update pipeline job failed (alerts CirrusSearchUpdaterKafkaMessagesInTooLow & CirrusStreamingUpdaterFlinkJobUnstable) with an OOM [14:24:57] it was down for several hours, I gave it a bit more mem and it restarted but took quite some time to absord its backlog: (https://grafana-rw.wikimedia.org/d/8xDerelVz/search-update-lag-slo) [14:27:44] Damn. I just scrolled back in #data-platform-alerts and we got them about 1 AM my time on Tuesday, sorry I did not investigate earlier [14:29:15] no worries [14:42:52] we also got a bunch of alerts for WDQS in eqiad that cleared on their own. Hope we don't have a bad query situation again ;) [14:43:38] yes... I just ignored those hoping that it would not last long, caused some troubles for about 30m [18:15:02] dinner [18:22:21] ryankemper I re-opened https://phabricator.wikimedia.org/T303011 to hopefully automate the deb pkg build process [18:54:20] hmm, mjolnir with the updated config failed but in a different way. This time the ApplicationMaster timed out...always fun guessing at what to change next :P [18:56:15] curiously from the output it seems like it finishes the feature selection, maybe it's getting hung up building the reduced dataset or something [19:07:19] lunch, back in ~40 [20:12:27] inflatador test test [20:12:42] elasticsearch [20:13:03] inflаtаdor [20:14:09] ryankemper I'm gonna ping you with a homoglyph, let me know if you get the ping or not (I'm hoping not) [20:14:14] ryаnkemper [20:14:37] inflatador: no ping [20:14:57] Cool idea [20:15:03] w00t! This might be an overly-clever way of avoiding pings [20:15:34] Altho I guess it’s actually a benefit of the i.nflatador way of pinging that it’s transparent that no ping is desired [20:17:00] ah good point. This is probably more like a novelty