[07:33:27] sweet: https://issues.apache.org/jira/browse/FLINK-21819 [08:58:12] dcausse: T279621 [08:58:13] T279621: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 [09:01:23] note that swift should now support S3 v4, so we might be already good [09:03:13] indeed, if the flink s3 client doesn't work with swift's s3 api implementation I'm sure both/either upstream would be interested in making it work [09:20:26] thanks! [09:21:11] I'll create a ticket to test&switch to a S3 client for flink [09:23:57] for sure, feel free to subscribe me and MatthewVernon (new data persistance SRE, being onboarded to swift) to the task [09:35:14] fun fact of the day: did you know that the python elasticsearch-curator depends on boto that depends on botocore (AWS libraries) that add 60M to each venv we have with it? [09:41:57] lunch [09:42:04] python is still a small player... du -sh ~/.virtualenvs/ -> 2,1G, du -sh ~/.m2/repository/ -> 5,2G :P [09:42:04] volans: bits are cheap! [09:42:31] gehel@durin ~> du -sh .m2/repository/ [09:42:32] 7.7G .m2/repository/ [09:42:44] :) [09:44:07] lol [10:28:31] lunch [12:04:49] to be fair, ./m2/repository is the aggregate of all Java projects I have, not a direct comparison to a single venv [12:15:24] gehel: T289770 Yes, let me try and write something there today [12:15:25] T289770: Add hints in response headers for 404 responses in Special:EntityData - https://phabricator.wikimedia.org/T289770 [12:16:17] addshore: thanks ! [12:17:20] I have a question back at you! If I were to start using the 0.3.84 wdqs updater code, from 0.3.6 any ideas what ight change in rdf output? [12:17:33] or where to look to see what might change? [12:17:42] that's a question for dcausse [12:18:07] addshore: nothing [12:18:19] regarding the RDF itself I mean [12:18:37] okay, so the only changes in wdqs version numbers really are the wey the internal indexing happens? [12:19:14] yes and bugfixes [12:19:33] I mean at the query side of things [12:19:47] epic [12:20:18] since it's a multi-module project the updater might get a version bump even-though nothing actually changed there [12:20:27] yup,, niceeeeee [12:20:43] 0.3.6 is quite old tho :) [12:20:50] so for this wbstack / wbaas thing we are gonig to base a "new" updater on 0.3.84 or something but keep 0.3.6 for the actually suery service for now [12:21:30] ok [12:21:39] lemme know if you run into troubles [12:22:28] but generally speaking we never change the RDF output, it's part of the stable interface policy [12:24:01] ack! [13:44:22] FYI that the Structured data team is talking internally about T288230 and the implications. I'll let you know when there's an update [13:44:23] T288230: Promote MediaInfo RDF format to stable - https://phabricator.wikimedia.org/T288230 [13:54:15] cbogen_: thanks! [14:58:35] \o [14:59:01] o/ [15:27:48] dcausse: thanks for the response on the UrisScheme, and sorry for being difficult [15:28:32] ebernhardson: np! this codebase is a bit of a mess [15:28:52] i've been trying to not say that....but ya thats how i feel too :P [15:32:30] it's a mess, but it's improving! [15:39:19] i suppose random note...wmde reopened T287563 so i responded with an ask for an exact amount of lag they want to consider a problem, and a reminder that search was never designed for real time [15:39:20] T287563: slow indexing of new Items on Wikidata? - https://phabricator.wikimedia.org/T287563 [15:40:26] makes sense [15:41:10] I've looked at the metrics today and did not find anything particularly high (> 10min) [15:41:20] i think their users consider 30s "high" [15:41:30] * ebernhardson is probably being a little hyperbolic, but not that much :P [15:41:34] :) [15:41:51] but I wonder if (in case of massive template update) if we should not have a prio topic for cirrusElasticaWrite [15:41:59] i mean, from the end user perspective it makes sense. Open tab to create item, missing thing, open second tab to create item, create it and go back to the first [15:42:37] since all cirrusElasticaWrite jobs are now materialized as opposed to before when we did synchronous writes from the LinksUpdate job [15:42:51] i wonder if the better answer isn't something like "query every new page by user in last 5 minutes and do a crappy plain-text search" [15:43:37] a priority elastica write could help, hmm [15:43:57] i suppose that is a regression, although a bit old now, since priority gets lost at that stage now [15:44:55] yes, I don't think that would help much in this case, but for the last time where there seemed to be more backlog due to the sanitizer and non prio updates perhaps [15:44:55] if you monitor the script i wrote (watches recent changes, then pings cirrusdump to wait for new revisions) the lag is typically low, but it goes up and back down in a bit of sin wave [15:45:23] search lag monitor: https://phabricator.wikimedia.org/P17040 [15:45:51] but anyways, i do wonder if that sin wave is basically the saneitizer [15:46:03] (would be easier to tell if it was a graph :) [15:46:03] the sin wave is likely the sanitizer yes [15:46:06] :) [15:50:24] i suppose i do wonder, the docs say "If a document has been updated but is not yet refreshed, the get API will issue a refresh call in-place to make the document visible.". Would that mean running a lag monitor that way would trigger lots of extra refreshes? [15:50:45] i guess we can turn realtime off, meh...not to figure out now anyways :P [15:51:18] ah I haven't thought about that perhaps issuing a search on _id is better? [15:53:16] https://en.wikipedia.org/w/index.php?search=pageid%3A1234&title=Special%3ASearch&go=Go&ns0=1&cirrusDumpResult [15:53:18] well, they have a realtime=false query param that can be passed to turn that off, i'd have to ponder which way is right for cindy. I also wonder if that causes odd load things in cindy where we update->refresh->update->refresh on almost every other req [15:53:44] ok [15:54:01] maybe we just turn realtime off there in general, on first though it seems ok but i'm not completely sure [15:54:26] for all the get() ? [15:54:33] for cirrusdump [15:54:43] i suppose maybe just the one we call through the API [15:54:48] prop=cirrusdoc [15:55:34] I think get() is mostly used for debugging anyways and since it's accessible from outside I think it's safer to disable realtime indeed [15:56:01] if it breaks cindy (which is already in a bad shape) there could be an option to re-enable that? [15:57:06] yea shouldn't be too difficult to let it vary [17:26:06] finally took a moment to actually look. Turns out we don't use get or mget, Searcher::get issues a search query by id [17:26:37] so, everything works "as expected" and it doesn't force the refreshes the way upstream get does [17:43:49] cool so the numbers you extracted are real [17:51:19] yea it would seem so. This has a bit of inefficiency since ApiTrait::loadDocuments does one title at a time, maybe doesn't matter though. Could probably deploy a prometheus metric for just wikidata without bothering to fix [17:54:58] makes sense, I think it's only wikidata that complained about latency [17:55:02] dinner [18:33:06] for WDQS codfw, I remember asking why a complex query is able to lock things up if it already got timed out, and I think the answer is that it times out on the user side, but the query continues running anyway. If that's true, why is that the case? Is it possible to kill a query that has timed out for a user and isn't going to return anything anyway? Am I misunderstanding how things work? [18:34:41] * ryankemper ‘s internet suddenly cut out and isn’t coming back [18:35:10] there’s workers doing roofing / other work for the neighbors and it’s probably completely unrelated but I always get paranoid and that they somehow severed the line or something [18:35:32] mpham: so there’s a lot we don’t understand but at a really high level blazegraph itself is getting locked up and rendered unresponsive [18:36:27] so there theoretically isn’t really a way to be able to “tell” blazegraph to stop because it’s not listening (besides restarting the service entirely) [18:36:46] gotcha [18:37:17] i remember that we don't log queries until they run, but is there a way to keep track of them in such a way that we don't rerun queries that we know have locked up blazegraph previously? [18:38:06] like, essentially have a 'do not run' list of queries [18:38:31] So yeah from the user’s perspective the frontend (nginx) will not hear back from blazegraph and should presumably time out at that level of the stack, so nginx should tell the user their query timed out [18:38:50] mpham: I don’t think we can really “know” which query actually did it [18:39:09] Like we could presumably have a list of all queries running at that time which would narrow it down [18:39:46] But because the query in question is actually toppling blazegraph the introspection we can do is pretty limited [18:40:14] I guess the other issue is that even if we banned specific queries, a slight alteration to the query wouldn’t get banned but would still cause a similar problem [18:40:27] good point [18:41:06] And I guess finally we’re probably not even technically sure if a given query will always lead to lockup or if it’s dependent on other context (other queries running, general load etc) [18:41:54] in terms of identifying the query, in my head it would be something like WDQS knows it runs query A, and then knows if query A finishes and returns results successfully. So if it knows it started B but never finished it, we'd know B is a bad query -- i'm not an expert here though [18:42:27] I guess one confound is it probably also started running queries C, D, E, F, and G around the same time [18:42:40] So we’d at least have a list of candidate suspects which is way better than nothing [18:42:40] i see [18:42:48] But probably no single clearly smoking gun [18:43:47] sounds like it's a bit messy to approach the problem from that direction then [18:43:50] we could try things like take a server out of the pool and sequentially play queries from that time period back at it. No clue if it would work, but if you have a killer query it might (but might also take days to run through and find nothing) [18:44:07] i suppose we don't have logs of the incoming queries though right? We only log after execution completes? [18:45:43] if it was one bad query continuously breaking things, it might show up at each of the lock up events. if each lockup event had different queries being run, might tell us it's not a single smoking gun: either multiple bad queries, and/or a combination of factors is more likely [18:48:55] ebernhardson / mpham: I don't remember off the top of my head when we log queries...if we do log only after execution, why would that be? [18:50:00] ryankemper: i don't actually know how wdqs works, but in cirrus we log after the query has completed so we can include information about the result. It seems like a common pattern [18:50:16] things like how long it took [18:50:20] Right, makes sense (altho kinda interesting to not log on both ends) [18:50:52] Tethered to my phone but it's very finicky so taking forever to load stuff bleh [18:54:18] mpham: the usual issue with finding killer queries or expensive queries is that once the server is overloaded, all queries are slow. [18:55:25] There is a timeout configured in blazegraph to kill queries over 60 seconds, but I suspect the implementation is problematic and still consumes resources after those 60 seconds. Timeouts are notoriously hard to implement correctly in non trivial scenarios. [18:55:40] And Blazegraph isn't the cleanest code I've ever seen :/ [18:56:34] I think we do log timing out queries in the web request logs, but we only log GET parameters. Larges queries often use POST, which we don't log. [18:57:05] We had a few ideas of things we can do to improve investigation, but they require writing non trivial amount of code. [18:58:20] For example, tracing threads to their queries, so that we could log how expensive a query is in term of number of threads. For the moment, we measure cost only in term of clock time, which is a very imperfect approximation for a multithreaded system [19:42:44] i'm guessing this non trivial amount of code would be pretty blazegraph specific? [23:00:10] ryankemper: one patch if you have time later, it's httpd config for the microsite serving wdqs and soon wcqs: https://gerrit.wikimedia.org/r/c/operations/puppet/+/714624