[00:30:41] meh, that's annoying. phabricator decided that 9795678 is actually a magic number, since difusion has a git commit that has the exact same prefix and decides to auto-format it as a link [00:30:55] * ebernhardson wraps in `...` and calls good enough, but a bit annoying. And perhaps lucky :) [08:40:09] That hammering from AWS seems like a recurring problem (we had the same on WDQS). I'm wondering if we should have a higher level solution. [08:40:57] We can obviously do some filtering / throttling / ... in our application (Cirrus, W[CD]QS, ...), but it feels like we should, at least to some extent, deal with this at the traffic layer. [08:41:01] I think sres have requestctl a way to configure some rules [08:42:34] the hard part might be to identify search request URLs, we have so many different ways to search [08:43:08] that might be the tool we use, but should we have a strategy as well? Do we want to throttle AWS in general? Only for things that seems expensive (Search, WDQS)? [08:43:26] Yeah, if only we had a reasonable API strategy :) [08:43:52] I'll at least open a phab task. Probably something to discuss at some point with traffic, or with Mark [08:44:34] or reuse T326757 ? [08:44:35] T326757: Investigate doubling of full_text search query rate since jan 1, 2023 - https://phabricator.wikimedia.org/T326757 [08:45:41] I'll link that ticket. But I'd like to trigger a conversation that isn't specific to Search [09:00:58] created T326782 [09:00:58] T326782: Generic strategy to deal with high volume / expensive traffic from cloud providers - https://phabricator.wikimedia.org/T326782 [09:01:04] we'll see where this goes [10:39:35] Errand, back in a few [10:54:00] lunch [11:17:17] Lunch [14:06:27] o/ [14:09:03] oh boy, more cloud users! [14:14:08] Looks like I need to learn how to use hive [14:48:41] errand [16:00:20] \o [16:01:24] o/ [16:04:34] doh, thought i pushed this patch up before finishing yesterday: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/879565 [16:04:53] dcausse: could you have a look? Hoping to deploy it later today to throttle our friend at aws [16:06:20] sure [16:10:46] \o [16:14:06] g'mornin [16:19:58] huh, i wonder if this actually worked. The cirrus import i started via spark3 finished in 6h27m [16:20:29] that'd make sense [16:20:59] if hive partition creation is almost instant [16:21:03] yea it looks reasonable so far. I'll do a little analysis to compare it to the other import, but only thing i had to do was swap spark-submit with the v3 [16:22:15] patch looks good to me, will wait for jenkins and hit +2 [16:22:36] thx [16:25:11] mjolnir feature collection seems to have worked now too, with the updated plugin. suspicious that multiple things seem to be working :P [16:26:47] :) [16:26:56] regarding automated pool counter [16:27:17] not new but we're not doing this detection on the completion suggester [16:27:49] most problematic one is the "Search" pool [16:29:32] hmm, yea this will only effect things using Searcher::{get, searchMulti} [16:29:53] phpcs is complaining :/ [16:30:16] doh, i guess i didn't run that part and only ran phpunit locally. I should better remember to run the other commands...sec [16:32:14] for completion search, we do see that pool regularly dropping some requests as well. Don't know that it's related to automation requests, but would probably be reasonable to get it using it as well. Maybe i can pull this impl up into ElasticsearchIntermediary, will check the uses [16:33:14] might make sense to pull it upper but probably in another patch, I feel that the rejection there are mostly a consquence of the increased load on fulltext [16:33:41] yea seems plausible, the cluster generally slows down under load [17:01:30] ebernhardson: perhaps good time to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/862343 as well? [17:01:58] oh! seems i completely forgot about that one. Sure i'll put into the deploy today [17:05:46] hmm, that pushes the deploy to 7 patches with the normal limit at 6. I've put it in there but it might get bumped [17:05:55] :( [17:06:35] it's 6 config patches and just our 1 mw patch, so maybe will fit [17:12:48] After years of work here is my "report" on my experiments with Blazegraph and friends https://harej.co/posts/2023/01/loading-wikidata-into-different-graph-databases-blazegraph-qlever/ [17:13:03] I want to test more Blazegraph alternatives so if there are any your contractor is looking into I'd be happy to tag-team with them [17:13:47] hare: thanks for sharing!! [17:14:00] (and writing!) :) [17:14:08] Regarding Blazegraph I don't think it tells you anything you don't already know :] [17:15:11] as for alternative I think we leaning toward jena and splitting into subgraphs [17:15:24] Interesting! How do you split out the subgraphs? [17:16:33] that's still to be defined but most probably based on the analysis made by Aisha (https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Subgraph_Query_Analysis) [17:17:02] have to go, back later tonight [17:19:40] great article, thanks hare [17:31:14] ebernhardson: if they don't get to that 7th patch but there's still time left in the backport window I can roll the deploy for it, since it should only take a couple mins [17:33:25] hare: I got a good (perhaps unintended) chuckle out of `Figuring it would be as simple as deploying my own instance of Blazegraph, loading the Wikidata TTL dump, and setting the query timeout limit higher, I set out to build my own Wikidata Query Service.` :P [17:34:12] Well you get to read how my naïveté was rewarded :D [17:37:12] I believe it was aristotle who said "there are three hard problems in computer science: naming things, cache invalidation, deploying massive graph databases, and off-by-1 errors" [17:37:25] Come to think, that might have been abraham lincoln [17:43:43] the reload failed again on 2009, looks like it fails after the munging step. I added some code for "--reuse-munge," the flag was there before but didn't do anything https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/876217 [17:54:27] lunch, back in ~1h [17:54:43] :( [17:57:23] hare: nice writing! Thanks! [17:57:56] It is somewhat nice to see that other struggle as much as we do in making an RDF backend scale to Wikidata size... [17:59:41] inflatador: fwiw afaict the reuse munge flag did work properly in the master branch, the issue is because the `fetch_dumps` method no longer exists in the same form in the nfs patch and that was where the the flag was getting used [18:00:12] i.e. previously `fetch_dumps` would `return` early if that flag was set [18:08:19] well, sure, it is necessarily going to be harder if you have 1% the resources of the other people doing it :D [18:09:11] but it's not just you, it's mysterious and frustrating [18:20:33] inflatador: I pushed a bunch of cleanup to https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/876217 jfyi. Also I think we need to think more on the kafka automated timestamp extraction; I think if we reuse munge but meanwhile there is a newer dump then we're going to end up with the wrong consumer offset [18:21:16] Like I think if we reuse the munge we will need to have previously saved the kafka timestamp that we observed when the munging was initially performed [18:21:45] Issue would only crop up if the dumps have changed between the munge run and the subsequent no-munge run ofc [18:22:10] hmm, yea sounds like the kafka bit needs to be written out into a file next to the munge to be reused [18:57:47] ebernhardson: do you know if the additional traffic we're getting from AWS has any specific patterns? Same kind of queries? Similar user agent? Any idea what this traffic is? [19:07:13] back [19:08:32] gehel: i haven't looked too closely, a few sample requests i looked at were multi-word queries. I can pull some more specific info [19:10:04] Don't spent time on this. It's mostly just me being curious [19:11:49] ryankemper ACK, definitely need to think about this more [19:12:44] when i looked earlier the user agent was some python client library, but i didn't check much to verify that was consistent [19:13:09] inflatador: ebernhardson's solution should be all we need. let's write a `/srv/wdqs/last_observed_timestamp` file or similar, and if the --reuse-munge flag is provided it should read that file to get the timestamp [19:13:24] sounds reasonable [19:13:43] re: user agent/bot abuse I'd like to get into that at the pairing session [19:17:04] gehel: was quick to grab a sample of 1k reqs, user-agent is python-urllib3/1.26.13 pretty consistently. queries are all lowercased and multi-word but not sure how would classify them. they are things like `fm's the left bank show` `an ordinary murder theatre` `aluminum industry standard`, `primetime emmy award outstanding variety series`, etc. [19:17:36] but that could just be the one sample, i grabbed a set of sequential queries from the same ip that was reported as one of the top service users [19:29:10] do you grab the samples from spark or kibana? [19:31:46] from spark, via a jupyter notebook. I already had a notebook that had most of the code, just had to plugin in a bit to select a single hour and grab a sample of 1k req's from a specific ip [19:32:09] didn't think to try kibana, i suppose relforge instances have the queries import might be able to get something there [19:34:43] ebernhardson can you do this with kafka? [19:34:58] like reading the kafka streams? Also we're in Guillaume's room if you wanna join for pairing [19:36:51] hmm, plausibly could do it from kafka streams. The spark data i used is a direct import of the kafka streams into hadoop [22:12:00] patches shipped, can see plenty of requests being rejected in the `automated` pool counter (as expected) and cluster load dropping. We also shipped the bit to disable real-time incoming link counting which again reduces cluster load [22:12:33] (incoming link counting in a batch job was shipped before i went on vacation) [22:58:04] /moti happy2 [22:58:14] (。◕‿◕。) [23:39:47] hmm, i'm not sure what changed but at 22:00 we started getting lots of logs for `Skipping a page/revision update for revision {rev} because a new one is available` [23:40:21] doesn't quite line up with the wmf.18 deploy at 19:11 [23:41:17] 22:00 is when the patch deployed turning off incoming link counting jobs, but it's not clear that would have done it [23:42:17] oh, nevermind those aren't new at 22:00. 22:00 logging blew up because thats the 'Automated' pool counter failures. the new one available messages are reasonably consistent