[07:20:29] looked at commons report 2x vs 10x, the diff is minimal and not sure that 10x is necessary? not sure what "AA" stands for, seems more conservative than 2x [07:21:35] and I suspect some diffs are not caused by near match boost but rather score discrepancies between shards [07:24:01] I think the interesting metric is "Top 3 Unsorted Results Differ" showing that something new is appearing in the top-3 which I assume is likely a near match [07:24:58] not sure that the fact that a near match hit switches from 3 to 2 is that important compared to a new near match hit now appearing the top3 [07:28:54] ah AA is perhaps comparing itself showing the randomness of the ranking? if yes it's a lot higher than I thought, 17.5% of the queries show a variation in the top20 :( [11:14:31] dcausse: indeed, AA is an A/A test, both the same. 10x isn't really a proposal, but more to show if there is "anything left", as in, did we miss anything. With 10x not going much further probably not [12:24:24] was looking at the CirrusSearch inconsistencies dashboard and seems like we improved over the years, most content models are now at <1000 problems, only thing that did not improve is javascript & css, possibly something broken there [14:01:27] now i'm going to be embaressed to think about this for a week, and david finds the problem :P Should have been obvious to test starting from null, but the frist step of my test script is to load some tags into the field and go from there... [14:08:28] o/ [14:11:11] this problem haunted me quite a bit too tbh :P [14:19:35] at least that doesn't sound too bad, i have a pretty good idea of where that could crop up [14:19:50] i'm assuming it's not in the multilist handler itself, but the orchestration around it plausibly sees the null and skips the handler? [14:21:26] hmm, but the set handler would also have the same problem if so...maybe not [14:30:50] yes... would not be surprised if we see the "set" dialect in the redirect field :/ [14:30:52] looking [14:32:52] hm... if I'm looking at the dumps in hive I wonder if I'll find them in extra_fields, could not possibly pass the table schema [14:50:06] hm looking discolytics@import_cirrus_indexes I understand that we would have failed if the type for the redirect did not match the table schema, I guess we're lucky on this one [14:50:54] you can run an explicit debug run to print problems but without this debug param I suppose it would just fail [15:01:54] hmm, would it just reject the set request? set handler uses object notation instead of an array, which shouldn't be valid in the mapping [15:02:21] hm.. I think opensearch does not care? testing [15:06:17] yes opensearch has no problem adding "redirect": {"add": [{...}]} [15:10:54] I guess we're fine simply because cirrus always return an empty array for the redirect field and somehow no docs end up with a null/unset redirect field [15:27:32] yea that might be [16:11:14] So what does it is the NullSafe wrapper. The SetHandler avoids it by not using the NullSafe wrapper [16:33:31] hmm, spotless re-formatted a bunch of stuff :S [21:00:50] ryankemper do you have anything for pairing? I'm working on T362114 but I'd just as soon skip [21:00:50] T362114: OpenSearch on K8s: Create Dashboards - https://phabricator.wikimedia.org/T362114 [21:04:21] inflatador: I'm fine skipping. I pushed some changes to https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service#Manual_Steps yesterday so feel free to take a look over, tldr I removed the step about manually creating the data_loaded flag in favor of using the --no-check-graph-type arg, and removed the bit about needing to manually start `load-dcatap-weekly` since that's been fixed in https://phabricator.wikimedia.org/T342162 [21:07:00] :eyes [21:09:40] ryankemper looks like we have a WDQS lag update firing, hmm [21:11:07] looks like lag is starting to creep up on CODFW again https://grafana.wikimedia.org/goto/7-QVQ2gDR?orgId=1 [21:11:15] looking as well [21:11:38] I think we're getting slammed by something and experiencing intermittent / brief deadlocks [21:11:44] I'm depool/restart 2011 since it seems the most lagged [21:12:31] possibly a repeat of https://wikimedia.slack.com/archives/C01LE7VPU5A/p1761221970505619 [21:16:12] inflatador: yeah let's try a rolling restart for now and see if things shake up [21:17:10] ryankemper looks like all hosts except 2008 are back down below 10m. I just depooled/restarted 2008, let's wait to see if that helped before restarting everything [21:17:15] BTW we can see here what's going on: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs-main&from=2025-10-24T19:37:47.958Z&to=2025-10-24T21:16:21.622Z&timezone=utc&var-graph_type=%289102%7C919%5B35%5D%29&viewPanel=panel-7 [21:17:22] huge increase in triples being added to db [21:18:56] interesting, I guess it's hitting CODFW first [21:19:01] to your point wdqs2011's really been struggling [21:19:40] and to yours, after I restarted 2, 2010 is now struggling. So it might just be whack-a-mole [21:21:04] Yeah, I'll do a round of restarts incase that unsticks some things. For now though it's mostly an issue of the raw volume of updates [21:21:17] also means it's good for the alert to be firing right now because we actually want to be throttling bot requests [21:21:19] +1, feel free to restart at will [21:23:10] Kicked off the cookbook, but cancelled it because I realized it's going to sleep for a little too long [21:23:14] will just use cumin instead [21:23:39] based on the output of `bking@wdqs2010 cat access.log | egrep -v "(prometheus|Twisted)" | awk '{print $12}' | sort | uniq -c | sort -n` , it looks like bots pretending to be humans [21:23:58] unless `Mozilla/5.0` is normally our top user-agent for WDQS [21:28:30] Cluster looks a lot better right now [21:28:35] We'll see if it holds :) [21:29:01] Indeed. I posted some info in that Slack thread, feel free to look it over and add anything I missed [21:33:18] The only thing I'm not sure of is, does this increase in triple count growth we're seeing also translate to a bunch of extra sparql queries [21:33:45] Because we see a clear spike in triple count, cpu load, lag etc when the incident starts, but not so much on the sparql query rate side [21:34:43] Although we did sort of see a spike in done rate and error rate [21:34:59] -_(ツ)_/¯ [21:35:26] * ryankemper needs a better shrug [21:35:27] ¯\_(ツ)_/¯ [21:48:26] inflatador: lag's back [21:49:15] I'll see if I can identify the source, but it's somewhat likely that I'll just have to occasionally restart it across the next couple hours and hope the offender stops [21:49:43] interestingly it is mostly codfw right now [21:56:27] Triggered another rolling restart of codfw. Need to step out for ~45 mins but will check in on the cluster at that time. Provided there's not too much other noise, letting the MaxLag alert fire seems fine since it is an accurate description of the problem