[07:52:23] o/ [07:52:46] need to do a school run, then I'll take a look at p95 query latency [07:52:56] we had a couple more spikes in the weekend [08:01:41] o/ [08:09:23] wondering if it's related to the 3 nodes removed from eqiad by Brian on feb 13, latency issues appear to be at 19:00 when usage is at its highest [09:12:36] dcausse ah! that could explain [09:28:13] dcausse seems to me that only three hosts are showing high latencies https://grafana.wikimedia.org/goto/mnreK55Ng?orgId=1 [09:30:22] gmodena: yes saw this one but unsure how to make sense of it, if you reload (I changed the tooltip on the p99 graph to sort by latency) you'll see that this is for the psi cluster but the user-facing latency issues are on the chi cluster [09:30:42] ah! [09:30:54] did not realize that [09:31:46] on https://grafana-rw.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles if you select the psi cluster it's not clear that there are lantency issues visible from mw... [09:32:31] but could be related... hard to say [09:34:51] could it be a plotting artifact? [09:35:25] i zoomed into the spike on Elasticsearch Percentiles, and I see the same timeseries regardless of which cluster i plot from [09:35:27] checking [09:35:43] gmodena: yes sorry this dashboard is hard to read [09:36:37] the first 4 rows (percentiles) are measuring the latency between mw and elastic and this includes all search clusters [09:37:24] the 5th one "Elastic Thread Pools" is per cluster [09:37:48] you'll see the thread pool queue going up only on chi [09:40:01] chi is labelled "production-search", right? [09:40:02] ah but since we do cross cluster searches, if say a search query -> enwiki is hitting a slow cross cluster search on psi I suppose it could possibly lead the thread pool on chi going up... [09:40:06] gmodena: yes [09:40:26] ok, them i'm looking at the right thing [09:41:35] gmodena: so yes you might be right, a slow psi node might possibly cause chi search thread pools to fill up... [09:43:30] looking at previous days to see if these psi nodes are behaving as well [09:50:18] I'm also testing the mjjolnir patch you reviewed on friday. I think the task dependencies look fine (compared to previous runs: https://airflow-search.wikimedia.org/dags/mjolnir_weekly/grid?dag_run_id=scheduled__2025-01-17T18%3A42%3A00.449096%2B00%3A00&tab=graph).I'll baby sit the process today [09:51:57] thanks! [09:58:22] I don't see same behaviors with one of these slow psi node (psi elastic@1086) on a previous instance of these latency issues [10:02:58] :| [10:09:25] I don't any weird pattern in QPS per type of search query... [10:14:29] does 'Elasticsearch - Mjolnir msearch' report latency of MLR-backed searches? [10:15:43] no this dashboard should be monitoring the queries sent by mjolnir to the search cluster while doing feature collection, not user searches using mlr models [10:17:19] ack [10:23:27] errand+lunch [11:12:25] lunch [12:36:23] Someone1337: it **should** work, not sure what the issue is. Opening a phabricator task is probably the best way to get our attention (unless dcausse has an idea). [12:59:03] gehel I can "Create a generic task" like if it was a StackOverflow question? Or is it a "software bug report"? I've never used Phabricator before [13:02:50] Someone1337: a generic task is good enough. Tag is with [Discovery-Search] and I'll try to route it somewhere. [13:09:43] dcausse could we maybe use 5 mins at triage to sync on paused/failed dags, and the state of wikidata dumps? I'm afraid I lost track of a couple of things :( [13:10:04] gmodena: sure [13:10:14] thanks! [13:38:14] Errand, back in 20' [14:27:53] building the 1.4.2 release of wmf-eventutilities to get rate-limiting support in the spark kafka writer [14:29:22] dcausse ack [14:30:38] i hit jar hell trying to run opensearch locally [14:30:40] org.opensearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to load plugin ingest-user-agent due to jar hell [14:31:02] does this ring a bell? I'm using the 1.3.20 docker image [14:32:49] docker was also nagging about the container kernel not being compiled with SECCOMP support, so I had to switch off syscall filtering [14:33:29] the elastic image works fine though [14:33:42] gmodena: you use the cirrus dev image? [14:34:10] dcausse I'm using docker-registry.wikimedia.org/repos/search-platform/cirrussearch-opensearch-image:v1.3.20-0 [14:34:35] weird... I'm using the same... [14:35:49] there's a comment from kostajh in the mw recipe about using an adhoc ES image on apple silicon (arm64) [14:36:10] I'm on silicon too, maybe I hit some platform kink [14:36:38] if I force platform: linux/amd64, the JVM segfaults at boot [14:36:42] that's... not great. [14:36:47] gmodena: yes I had to use a custom image for that reason. [14:37:16] kostajh ack [14:37:21] fun times :) [14:39:24] gmodena: what if you try to build the image directly from https://gitlab.wikimedia.org/repos/search-platform/cirrussearch-opensearch-image [14:40:15] my understanding is that it might build it for the right platform? [14:40:49] attempted to make a cross platform build setup at https://gitlab.wikimedia.org/repos/search-platform/cirrussearch-opensearch-image/-/merge_requests/3 but it does not quite work within our CI [14:40:56] and blubber [14:49:23] dcausse i forked the repo to experiment a bit with it [14:49:48] for now, if I need opensearch, I'll spin it up on a linux box [14:53:45] wikimedia-event-utilities-maven-release build fails :( [14:54:20] oof :( [14:55:22] seems like a test being a bit flaky, looking [14:59:50] ack [15:00:37] re opensearch: building locally did the trick. The container starts, now let's see how stable it is. [15:11:58] gmodena: if/when you have a sec https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/1120189 [15:18:10] dcausse done [15:18:18] thanks! :) [15:18:40] np [15:19:38] re import_wikidata_ttl dag issues, I believe that the wikidata RDF dumps are simply not available (not even produced for 20250203 & 20250210) [15:23:47] yep; datasets are not there [15:24:24] subgraph_ is also waiting on those [15:25:31] maybe we should pause? [15:27:06] gmodena: dumps can be generated a posteriori so these runs will have to fail I'm afraid [15:27:13] s/can/can't/ [15:33:18] gmodena: had to relax that same test again :( [15:33:25] https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/1120189 [15:36:38] done. +1 on maybe re-thinking this test [15:36:47] yes... [15:36:50] timing things right is bound to be iffy [15:37:14] re wikidata dumps https://phabricator.wikimedia.org/T384625 [15:37:47] and https://phabricator.wikimedia.org/T386401 [15:46:18] thanks! [16:01:21] dcausse: we're in https://meet.google.com/eki-rafx-cxi [16:01:25] oops [16:50:08] sigh... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 207) (an-worker1083.eqiad.wmnet executor 1): java.lang.NoSuchMethodError: com.google.common.util.concurrent.RateLimiter.acquire()D [16:50:23] guava jar hell :( [19:12:38] :( [19:15:58] to be fair, I think we ran a rather old version of spark [19:16:07] *run