[00:25:38] inflatador: (for tmrw) do we know what stage it failed at? one thing that occurred to me is I think we're extracting the kafka timestamp too early [00:27:05] inflatador: see https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/873782 / https://phabricator.wikimedia.org/T325114#8492035 [00:27:32] that being said that may just be orthogonal to the actual problem breaking the reload that you're/we're seeing, just something I noticed a week or two ago that might be a problem [10:37:17] lunch+errand, back around 3pm CET [10:56:09] lunch [13:59:32] o/ [15:07:55] dcausse since the DSE cluster has flink operator now, should we try to deploy rdf-streaming-updater there? Was thinking that would be the next step but LMK if not [15:11:17] inflatador: hm... that'd be for testing only I think, i.e. using the same testing conf we have on k8s-staging@wikikube [15:12:23] dcausse ACK, should we refactor the updater's chart to use Flink Operator? [15:12:34] but first we need to do T289836 [15:12:35] T289836: Upgrade the WDQS streaming updater to latest flink (1.15) - https://phabricator.wikimedia.org/T289836 [15:13:03] inflatador: not sure if refactoring is the right approach, a new one might be better [15:13:18] (but I don't know) [15:13:50] there are few open questions to discuss as well regarding namespaces [15:14:27] question in in the ticket here T326409 [15:14:27] T326409: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 [15:14:37] "What namespace strategy should we use for flink jobs? A single one for all wmf flink jobs, per team, per project?" [15:20:04] inflatador: I'll get started on migrating the code to flink 1.15, if you want you could get started on the chart? [15:23:11] dcausse sounds like a plan. Are we confident the latest flink image (https://docker-registry.wikimedia.org/flink/tags/ ) will work in production , or do we need to do any other testing? [15:23:43] inflatador: we'll have to build our own image [15:23:55] reusing what Andrew has done [15:24:13] the application will be bundled into the flink docker image [15:26:45] Ah, OK, gotcha. I'll read thru the tickets and work on the chart [16:00:56] Search Platform office hours happening now: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours [16:02:41] o/ am still testing things now. flink k8s operator is deployed in dse-k8s-eqiad, but i have not yet deployed a flkn app using it. [16:02:48] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878210 [16:03:27] gabriele is also working on building flink image, will cc you in slack thread [16:03:27] https://phabricator.wikimedia.org/T326731 [16:07:56] <3 [16:18:02] appt’s getting delayed, will be around in about an hr [17:04:07] workout, back in ~40 [17:50:16] taking my dad out to lunch soon, so will not make the unmeeting [18:12:07] lunch, back in ~1 h [18:14:24] dinner [18:28:23] kicked off cirrus index import via spark3 to see how it's going to work out, this doesn't have any specific python dependencies so it's as simple as using the spark3 version of spark-submit [18:43:19] I am writing a blog post about my experiences with Blazegraph and other graph DBs. Anyone want to fact check this paragraph: [18:43:25] "As Blazegraph is mostly abandonware at this point, it is basically a black box that you reboot when it falls over and rebuild when it fails completely. (I have had to do one such rebuild.) Loads from scratch can take several days; originally there were timeouts, but the time limits were simply removed, which merely solved the issue on the surface. As mentioned before, queries can take exceptionally long to execute, or never execute at all. [18:43:26] Certain queries with intermediate result sets result in out of memory errors, even on generously resourced workstations." [18:43:58] sadly, that seems accurate to me [18:44:28] i don't know that we have problems with queries not executing at all, but doesn't mean it doesn't happen [18:45:17] I have had experience with OOM queries even with very generous limits [18:46:08] Either because something spiraled out of control or there was an intermediate result set that was too big [18:46:13] yea sounds like something that blazegraph would do [18:57:28] also, is it correct to say definitively that Blazegraph itself does not have result pagination [18:57:46] or does this feature exist and I have just never used it in 7 years [18:58:33] hmm, sadly i don't know. It's plausible there is some sort of way to keep a handle open and then paginate the handle, but i've not seen it [19:41:40] reading this comment from Nik I think it might support that but I don't think we ever tested/exposed this feature (https://github.com/elastic/elasticsearch/issues/12188) [19:42:06] hmm, the 2-year graph of production-search-eqiad thread pool usage is a bit concerning. oct-nov 2021 was 300-400 threads. jan 2022-dec 2022 was pretty consistently in the 400-500 threads range. There was something changes jan 1 2023, we've been at 800-1200 since [19:42:21] dcausse: the open handle bit was wrt blazegraph [19:42:32] this comment as well :) [19:42:40] hmm, but you linked elasticsearch? [19:42:50] oh, i see it's about blazegraph. i should read :P [19:42:54] yes :) [19:45:55] our high loads graph has also been going wonky since jan 1 :S looks like due to an increase in full text qps. dec 2022 we were doing 200-450qps, after jan 1 it's 450-700qps :S [19:46:42] yes it doubled :( [19:47:08] likely a single client [19:48:28] i suppose i mentioned earlier doing some analysis of the cirrus request logs in hive....i guess i should get to that then and figure out if this can be attributed to some grouping of clients [19:49:12] makes sense [20:10:47] back [21:51:53] * ebernhardson didn't expect a simple group-by client_ip on 600M full_text search requests to take 20m.... [22:00:24] that sounds...suboptimal [22:01:01] inflatador: finishing up lunch, ~6 mins for pairing [22:01:28] ACK [22:02:57] Thinking we should look at https://phabricator.wikimedia.org/T326409 . It's a LOT to digest (and I've forgotten half of it over the holidays ;( ) but some notes here https://etherpad.wikimedia.org/p/rdf-flink-k8s [22:08:01] and the top req's by day/ip gives...2M reqs per day from multiple ips that resolve to *.amazonaws.com [23:11:31] so, basically yea something on aws is hammering us. Requests from non-public clouds are reasonably consistent at ~35M/day. Req's from AWS subnets increased from 1.5M/day to ~25M/day [23:15:39] ebernhardson: What are we thinking mitigation wise? Do we look into splitting the search poolcounter into public cloud vs not, or alternatively just hope that the influx is coming from a small enough set of IPs that it's feasible to ban them? [23:17:21] ryankemper: there is already a split in the poolcounter i made awhile ago, but currently it requires us to provide user-agent regexs. I'm not 100% sure but it looks like the new bits in varnish since then apply an `X-Public-Cloud: aws` header to requests coming from aws (and similarly for gcp, linode, etc.). Going to add a bit that allows us to configure a header than when existing we [23:17:22] consider it an automated request and shuffle into that bucket [23:17:43] but i don't know if i have a great way to find out if those headers actually make it to mediawiki, or if it only somehow exists at the varnish layer or some such [23:19:31] i don't seen mention of the X-Public-Cloud in anything other than varnish stuff so noone else is using it that way yet [23:20:28] well, i guess conftool has some stuff but thats also varnish support stuff afaict [23:28:36] hmm, tcpdump'ing some requests on wdqs1004 gets the `x-public-cloud: aws` header, i suppose can guess if they make it there they also make it into mediawiki