[09:57:13] lunch [10:12:51] first quick dashboard to illustrate the prometheus metrics exported from flink https://grafana-rw.wikimedia.org/d/gCFgfpG7k/flink-session-cluster [10:12:56] lunch too [10:16:58] relocating [11:23:05] dcausse: I'd love to have a streaming updater plan walkthrough, preferably with ryankemper next week - it looks clear, but I want to make sure I understand everything, makes sense? [11:25:00] btw, level of detail in it is great :) [12:05:51] zpapierski: I'm out next week so we should do this today or tomorrow [12:06:16] ah, pity, ryankemper won't be joining us this week, but ok [12:07:32] I'm going lunch in a sec, we can do that during some time during open hangout [12:08:58] break [12:37:56] rebuilding the image with the updated flink plugin [13:11:45] hm.. maven does not seem increment versions like 1.2.3-wmf1 the way I wanted... [13:16:18] zpapierski: when you are around, can you join https://meet.google.com/ugw-nsih-qyw ? [13:31:18] relocating [14:19:28] dcausse: if you want company for the interviews tomorrow, I can jump in... [14:19:41] gehel: sure I can invite you [15:30:17] zpapierski: https://gerrit.wikimedia.org/r/c/wikidata/query/flink-rdf-streaming-updater/+/710278 I thing there was a packaging issue too that might have caused the problem (the original swift client jar was copied under lib) [15:43:07] maybe, but from what I read, since each jar has its own class loader, it wouldn't have worked anyway [15:51:35] * ebernhardson realizes that mjolnir hasn't finished norm_query_clustering in a few weeks but the tasks don't actually die so it doesn't fail [15:51:48] one has been running a single instance for 279 hours :P [15:52:28] ouch [15:52:54] probably some kafka wonkery, we are doing a crazy hack there... [15:53:20] oh indeed, that reminds me painful debugging sessions :( [15:54:22] was talking with Joal last time and he mentionned that might perhaps relax the wall between analytics and prod [15:55:40] hmm, interesting but seems dangerous :) [15:57:48] it's what they do with aqs iirc, they push directly from hadoop to cassandra [15:58:05] that seems a lot simpler than what we have :) [15:58:11] it's what we used to do, but were asked not to :)_ [16:00:20] yes... no clues if this is going change anytime soon anyways [16:08:45] works better: Completed checkpoint 1 for job 1566c08b8f5aedc3cb8de2a400b9ed94 (73749306 bytes in 13430 ms) [16:09:15] but now: https://meta.wikimedia.org/w/api.php?format=json&action=streamconfigs&all_settings=true failed. BasicHttpResult(failure) encountered local exception: Unsupported or unrecognized SSL message [16:11:24] hmm, i dunno about this exact thing, but i've seen that elsewhere when https connect to http endpoint [16:12:04] don't know how though, it's using a normal url without ports :S [16:18:37] yes that's exactly it, connecting with https to a plain http service [16:24:50] works well now [16:46:39] for mjolnir it should have been more obvious...but the daemons think eqiad is busy. Looking into why the grafana dashboard doesn't report the value, i had to query the prom exporter on the hosts [16:49:44] meh, 1 of the 8 instances on each host is reporting 0 instead of the proper value....i guess since its not really "broken" will just leave it to fix itself when traffic moves [16:50:48] because of morelike? [16:52:16] yea, this is an estimate of the delta in query_total for enwiki_content [16:52:30] estimate is ~830 on eqiad and ~4400 on codfw, i suppose thats shard queries per second [16:53:16] ah I thought it watched only "fulltext" but morelike are probably reported as fulltext too :/ [16:53:50] yea, for stats groups we send the full set of syntax used [16:54:23] and I completely failed to ship the patch today sorry... I had scheduled it but went away, 1pm is a bad time for me :( [16:54:32] and missed the window [16:54:41] it's ok, its been down a few weeks a day or a weekend isn't going to be much different :) [16:54:58] true :) [16:55:08] we mostly run the model weekly to know when it breaks and identify, so it will work when we need it to. So, success? :) [16:57:51] :) [17:18:19] dinner [17:46:58] airflow tasks are complaining since i restarted the scheduler to deploy a plugin update, they are all wait_for_* and don't mean anything really [17:47:09] (as in, you can ignore the emails to -alerts :) [19:34:32] dcausse gehel hi, we just dropped 100 req/sec from wdqs, we will drop twice more soon. Context: T176312 [19:34:32] T176312: Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions) - https://phabricator.wikimedia.org/T176312 [19:35:22] cool! I remember being surprised looking at wdqs query logs the first time and finding it was all regex queries :) [19:36:14] Amir1: thanks ! [19:36:40] Or no thanks: you just made me remember this "feature" :/ [19:37:21] If it makes some noticeable impact on wdqs and specially its reliability. Let us know! [19:37:41] WDQS is a "regex as a service" [19:37:52] it's now 42% on k8s shellbox but I'll ramp it up to 100% on Monday [19:38:00] gehel: more like sandbox as service :D [19:38:29] Amir1: you are reducing the load, right ? I don't see any way that this would negatively affect WDQS ( famous last words) [19:38:51] the shellbox in k8s is much faster because it uses RPC, already checkconstraint job is sky diving https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?viewPanel=3&orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-job=constraintsRunCheck&from=now-6h&to=now [19:39:17] gehel: hopefully, but I mean more like CPU usage, etc. if those go down [19:39:28] Imagining that Blazegraph provide any kind of sandboxing is "interesting" [19:39:31] and we can see that wdqs is healthier [19:39:44] Yep, that make sense [19:40:12] cc: ryankemper ^ [19:41:14] It's all traffic on the internal WDQS cluster (or should be at least). This one is usually ok in terms of load [19:41:42] let me check if it's really goes to internal [19:42:01] maybe we can repurpose or reduce capcity of the internal cluster [20:03:54] As far as raw # of hosts we have it balanced to where we can't go any lower in terms of # of internal hosts, so only way we can take advantage of the extra load is if there's some type of non-public query/job thingy that runs on the public cluster that could be shifted to the internal [20:04:02] IIUC, if not disregard :P [21:20:53] Hi all. Yesterday on the office hours, someone, and I apologize for forgetting who, was saying that Redis may not be necessary for Elasticsearch. Something about the database table row size. Does anyone know if that is still indeed an issue or if it was resolved? [22:58:59] justinl: ebernhardson would be the one you would want to talk to about Redis, I think.