[01:12:40] * ebernhardson somehow didn't realize there are 11B links between wiki pages, until actually counting to write comments to justify the spark configuration [01:13:25] 11,673,028,280 per the latest cirrus dump [09:40:46] ryankemper / inflatador: I'll skip our pairing session today. Mostly non-stop meetings from 3:30pm to 8pm, I'm going to be useless after that. [11:05:51] lunch [11:12:28] lunch 2 [11:34:42] lunch [14:33:18] gehel ACK [15:55:39] \o [16:00:42] for puppet deploy window today, could we ship https://gerrit.wikimedia.org/r/c/operations/puppet/+/855673 ? Requires a rolling restart of the small clusters (usually painless) [16:23:19] o/ [17:11:51] errand & dinner [17:24:30] hmm, the orderby.sparql query in rdf test queries fails due to timeout. I wonder if we can use a more restricted query? No clue what it's testing though [17:32:20] it seems to want to sort over all authors, issuing a matching count query from the blazegraph workbench takes 1m50s and reports it's sorting over 27.5M rows [17:32:33] (elasticsearch sorts 27.5M rows much faster :P) [17:52:41] ouch, i started the actual order by query and did something else for a bit, 15m later it's still running :S [17:52:59] is that more of a regression? I have to imagine this worked at some point [18:57:00] killed query after ~50m. seems it shouldn't be in the test set. finished deploy, incubator seems to be properly whitelisted now [19:30:52] o/ we interviewed a candidate for an SRE position today that had worked on this: https://github.com/criteo/garmadon looks cool for monitoring hadoop, but also for spark and flink apps running in yarn [19:36:31] never heard of it, but +1 to it's opening line: As someone who has already used Hadoop knows, it is very hard to get information from what has been running on the cluster, despite logs and frameworks specific (Spark driver's UI for instance). It gets even harder to get that information when an application failed in an unexpected manner. [19:36:59] reading nodemanager logs is the opposite of my idea of fun :S [19:43:16] ebernhardson: I think I never ran these queries, are there indications that we have to run on deploys? [19:43:34] dcausse: the deploy docs :) https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service#The_actual_code_deploy [19:43:58] oops :) [19:44:00] supposed to run them pre-deploy, then again against the canary before continuing [19:47:14] not sure I understand how this works :/ [19:47:22] I don't see them in the deploy repo [19:47:28] they are in the rdf repo [19:48:11] there is a shell script that sends queries via curl to a localport you should portforward into the canary. has foo.sparql, if foo.result exists then it should match the response, if no foo.result then it just had to not fail [19:49:19] seems stas added the orderby in 2016 and it's been the same since, for T130318 [19:49:19] T130318: Query tests for WDQS deployment - https://phabricator.wikimedia.org/T130318 [19:49:29] ok I definitely never done this [19:49:32] i suspect can simply delete the orderby, or maybe limit the query more so it's sorting 100k items [19:49:38] or remove it from the deploy docs :) [19:49:39] Ryan might have run into similar issues [19:50:08] +1 to remove test queries that might kill a server [19:50:36] it times out by default, i was curious so ran it directly in the workbench which skips the timeouts we add in nginx [19:51:19] but sure, might as well delete it's not doing what it was intended to [19:51:58] dcausse: ebernhardson: yeah in the test suite the orderby query has been broken since I got here, always fails [20:06:56] back...sorry, forgot to say I was going to my son's recital [21:15:21] deployed a +10% to the CirrusSearch pool counters, this means we allow slightly more total parallel queries from mw->elastic. Hoping that reduces the pool counter rejections we've started to see since the 7.10 update. If it's too much i would expect to see increased levels of thread pool queueing in the elasticsearch graphs [21:16:04] alternatively, we could have bots that are pushing the search api's until they get rejections, in which case it wouldn't matter how much more we allow *shrug*