[05:57:42] can't think of an easy to do this with es7.10... [05:58:07] wondering if a mix of copy_to & token_count could help... [05:58:24] or a ingest pipeline [05:58:37] but that seems silly to just index the size of an array :/ [05:59:01] this: https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime-indexed.html would have helped I guess but that's only available on es7 [05:59:05] s/es7/es8 [06:00:39] and opensearch is not there yet https://github.com/opensearch-project/OpenSearch/issues/1133 [07:42:26] inflatador: looks like the embedded sre meeting is gonna be during the first hour of the sub team breakout today (tuesday). That's 5-6am here so too early for me but might be doable in your tz [07:43:52] actually scratch that I was thinking eastern time, that's still be like 7-8 am from texas [08:23:16] Hi! Do we have a repo defining the flink production runtime environment? I’m interested in which dependencies of a flink application can be considered provided and which have to be packaged with it. [08:32:31] pfischer: are you available for our 1:1? [08:34:55] pfischer: about the question above, the best we have is probably https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/flink-rdf-streaming-updater/+/refs/heads/master (you'll have to learn about "blubber"). dcausse might have a better idea. [08:36:33] pfischer: in general it's fine to package your deps in the job itself [08:44:34] for flink deps it's generally mentionned in the flink doc, for everything else it can be attached in the job jar unless it's something like a log4j2 formatter that you'd like the flink runtime to use [10:21:21] lunch [12:13:08] ryankemper np, I'll give my regards [13:16:22] brb [13:53:35] back [14:14:57] dr appointment, back in ~30 [14:31:13] back [14:42:19] what is the relationship between WDQS and wikidata? can wikidata exist without WDQS? [14:44:02] inflatador: technically yes but there are too many editing workflows relying on it that the answer is probably no [14:46:10] dcausse thanks, I'm looking thru this Blazegraph/Neptune page again, will call that out [15:12:53] \o [15:12:59] o/ [15:44:24] ebernhardson looks like the reload cookbook finished, but couldn't reenable puppet for trivial reasons. Should I start a reload now or do you want to take a look at the data first? [15:47:27] inflatador: hmm, you mean start the data-transfer? I suppose lemme poke it a bit [15:48:17] indeed, transfer [15:50:21] inflatador: looks like the updater is still paused, likely because i manually did that earlier. Did the cookbook say it set the kafka timestamps? [15:50:59] puppet is equally disabled for same reason, i've just re-enabled puppet on wcqs2001 [15:53:27] ebernhardson it wouldn't re-enable because the disable/reenable message has to match (unless you force it, which is what we typically do ;) ) [15:53:34] as far as the kafka timestamps, no output on that [15:53:50] hmm, Sep 20 15:51:56 wcqs2001 systemd[1]: Condition check resulted in Query Service Streaming Updater being skipped. [15:54:05] systemd decided to not start the updater when run-puppet-agent tried to [15:54:28] the data_reload flag is missing perhaps? [15:54:41] maybe [15:54:47] it should be touched at the end of the reload cookbook [15:54:56] inflatador: did the cookbook die when reenabling puppet, or just complain it couldn;t? [15:55:14] so something might have failed when importing [15:55:39] yea it must have, there is a 200M wcqs.jnl and no data_loaded [15:56:01] munged is 36G which seems sane [15:56:09] No errors on the cookbook I can see, except for the puppet enable [15:56:11] 200M is too small I think it's the size when blazegraph starts fresh [15:56:32] we can manually run loadData.sh i suppose [15:57:18] yes I don't see why we could not, unless munge failed at some point [15:57:45] hmm, actually munged only has 790 files. When i ran it before it was 1580? [15:59:02] it's possible that reload-wcqs did make smaller chunks [15:59:19] I'll paste a stacktrace, puppet and kafka are both mentioned in it [15:59:22] is there an ouput of the cookbook somewhere? [15:59:47] oh, yea the wcqs-data-reload script was passing `-c 50000` to the munge script, the cookbook uses default options [15:59:51] https://phabricator.wikimedia.org/P34885 [16:00:56] inflatador: I don't see much in the stack is there more in the logs? [16:01:15] dcausse will check, I don't see anything in the tmux output besides that [16:01:24] ahh, so it abandoned early [16:01:31] inflatador: you should be able to write the scrollback to disk, sec [16:02:12] inflatador: -b then :capture-pane -S -3000, then -b again and :save-buffer /path/to/file [16:02:30] iirc cookbooks should write to a log somewhere [16:02:45] yeah, they do log [16:02:46] inflatador: logs should be in /var/log/spicerack/sre/wdqs [16:03:16] with that error message though, the cookbook bailed after restarting blazegraph but before loading data [16:03:42] oh ok [16:04:07] i wonder looking at this cookbook if it's correct, it disables puppet, stops the services, then reenables puppet and does the data load [16:04:13] I wonder if now stopping blazegraph from systemd returns an error code [16:04:17] i suspect puppet is supposed to stay paused so the updater doesn't start in the middle of data load [16:04:29] I'll xfer logs/dump tmux buffer and get back to you shortly [16:04:41] dcausse: no, it was because i manually disabled puppet and the end of the `with puppet.disabled(reason)` failed [16:04:54] ah ok [16:04:55] but i think the cookbook is also wrong there, it shouldn't have re-enabled puppet so early [16:05:34] dinner time, back later [16:06:04] meh: Phan\Exception\InvalidFQSENException: Invalid namespaced name for FQSEN '\1' in /mediawiki/extensions/GeoData/vendor/phan/phan/src/Phan/Language/FQSEN/FullyQualifiedGlobalStructuralElement.php:134 [16:06:45] cookbook logs are at wcqs2001.codfw.wmnet:/tmp/ [16:06:45] data-reload.logs.tar.gz [16:06:49] inflatador: i'll make a pair of patches for the cookbook, few mnis [16:06:51] mins [16:07:08] ebernhardson sure, the joys of working blindly ;( [16:07:26] if you need the tmux buffer LMK, otherwise I'm gonna step out for a bit [16:08:26] inflatador: should be ok [16:08:54] ebernhardson cool, working out, back in ~30-40 [16:08:57] i do wonder what the right way to flag a re-use munge is though...it should have some check that the munge is valid [16:11:18] it happened in the past that the munger failed in the middle, so not sure how to do a sanity check without re-reading everything... [16:12:06] if the flag is passed by the operator we could assume that it's correct [16:12:08] can we assume if the last item in latest-mediainfo.ttl.bz2 is found in the last munge output file that it's probably valid? Still annoying because you can't tail a compressed file, but might work [16:12:26] yea might be better, tailing the 30+gb file will still take 15+ minutes [16:35:08] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/833422 and https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/833423 should do what we need [16:35:43] i'm kinda surprised we never noticed puppet gets turned back on before the dumps are loaded, i guess we don't reload very often but the streaming updater must sometime be started by puppet before its done? [16:40:23] reloads are rare, we never used this cookbook with the streaming updater other than the initial load [16:40:44] the data_reload flag should be what prevents the updater from running [16:41:02] data_loaded I mean [16:41:42] unless puppet uses something else than systemd to start it but I doubt [16:41:45] ahh, ok that makes sense [16:42:45] i guess we could only stop puppet in the initial bit then, although feels a bit wierd [16:43:27] the only thing we left stopped was the updater [16:45:51] yes... not sure why we don't want puppet up, perhaps we did not trust what could happen if something gets updated [16:47:29] thats what it was doing, it was stopping puppet, stopping blazegraph, updater, delete journal, start blazegraph, restarting puppet. I just added a patch that would keep puppet disabled through the reload, but perhaps i should drop that ptach [16:48:16] it was surprising to me, but i suppose i wasn't thinking about the data_reload flag keeping the updater down [16:50:29] tbh I thought that puppet was disabled during the whole reload [16:50:54] and not sure what's best here [16:51:08] Guillaume might remember something [16:51:18] it seems like if the only thing we are worried about is the updater, and the updater gates on the data_loaded flag, it should be safe to let puppet run [16:53:59] hm... I barely remember sres complaining that puppet can't stay disabled for so long so might be better to keep it running [16:54:26] long = 8+ days in the wikidata case [16:54:59] but we often had to finish the import manually anyways... [16:56:19] yes there is something about it being dropped from state somewhere if puppet doesn't check in regularly enough [16:56:26] and it being more painful to bring it back [17:20:24] back [17:21:27] ebernhardson hosts will get erased from puppetdb if they don't checkin after awhile. Not too hard to re-register but I'm not sure if it creates alert spam for other teams [17:24:28] I'm gonna go ahead and take lunch. We can work on the cookbook during SRE pairing in an hr if y'all want [17:24:37] kk [18:14:36] back [18:17:51] unclear what to do with deepcat, the problem is there are 303k categories within a depth of 5 from Category:People [18:18:59] and thats a distinct count, honestly a bit surprising :P [18:26:55] fun fact, per elastic it seems enwiki has 2.2M category pages. Might be excessive :P [18:27:05] :/ [18:27:25] Deletionist! :P [18:38:37] best we could do is aligning the various timeouts and possibly have a better error message but if deepcat works on some other queries then this ticket should be deprioritized perhaps [18:45:29] yea i suppose the envoy timeout should be adjusted to match blazegraph, i ran the queries on a wdqs instance through nginx so that part is fine and not timing out [19:47:27] heh, our wdqs timeout on nginx is 5 minutes. the envoy timeout is 10s, the SparqlClient timeout is 3s. Critically SparqlClient sets the blazegraph timeout and the http timeout to the same values, which almost ensures http times out first..hmm [19:48:37] sparql timeouts don't return json anyways, even though they were asked to. I suppose we can at least look at the headers and see a text/plain instead of application/sparql-results+json and know we should enter error handling