[09:57:12] Errand, back in a few... [10:36:30] gmodena: I'm late. Can we reschedule for this afternoon? [10:36:46] gehel sure thing, no worries! [11:46:24] lunch [14:14:34] o/ [14:47:25] d-causse is out this week, right? [14:51:56] inflatador yep [14:53:44] inflatador I'm keeping an eye on SUP on his behalf, but holler if you spot something that looks off in flink/es metrics. [14:54:36] today the systems seem well behaved [14:56:41] gmodena cool, thanks. I was gonna ask him to take a look at some wdqs-categories stuff, but it's not urgent [14:57:27] inflatador ah! that's not something i'm familiar with :( [14:58:02] https://www.mediawiki.org/wiki/Wikidata_Query_Service/Categories if you're curious. Otherwise, no need to get too deep into the weeds ;) [14:58:47] i'll have a look! thanks for the pointer [15:04:08] FWIW mjolnir is doing fine too. It caught up with all pending training runs, and is now on schedule. [15:04:33] errand. back in 30 mins. [16:59:05] dinner+kiddos [17:00:03] i'll be back to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1120534 (stop the MLR a/b test) later this evening [17:07:12] workout, back in ~40 [19:02:43] lunch, back in ~1h [20:14:36] back [20:15:28] looks like we got another couple of `CirrusSearchFullTextLatencyTooHigh` alerts that quickly resolved, checking 'em out [20:52:19] inflatador seems like the pattern we've seen in the past week. There's a latency spike around 20CEST, when traffic peaks [20:53:18] i'm unable to correlate it to SUP / backfills activity that could strain ES [20:54:14] disabling the A/B test should help, since it'll reduce the rps on the index (we stop interleaving queries), but I don't expect the latency spikes to go entirely away [20:54:39] FWIW we've been seeing this since 3 hosts were decommissioned [21:03:13] gmodena I think last time David and I talked about it, we traced it down to a few badly-behaved hosts in one of the secondary clusters? Maybe psi? Does that sound familiar? [21:03:31] inflatador it does! [21:04:40] OK cool, looks like we got the alert yet again before I got a chance to check the dashbaords ;( . checking now [21:05:45] Our thought is that the sample size is too low for the secondary clusters and a few bad queries could affect the average [21:13:43] inflatador what do you mean with sample size in this context? [21:25:17] the PromQL `rate(mediawiki_CirrusSearch_request_time_seconds_bucket{type="full_text"}[5m])` ... theoretically, a few bad queries to a secondary cluster could slow things down enough to trigger an alarm. Subject to verification tnough [21:28:11] I'm also having trouble figuring out why the per-node percentiles look OK, but the cluster-wide percentiles look bad [21:29:07] inflatador ah! got it. A false positive could make sense [21:31:20] yeah, although if we are getting a lot of pool counter rejections, it seems like there might actually be a problem https://grafana.wikimedia.org/goto/zUt4EecHg?orgId=1 [21:47:10] inflatador that could be an artifact of SUP + ongoing A/B test + weighted tags backfill [21:48:25] SUP has now three replicas writing to the index (needed because of high backpressure/consumer lag) [21:48:32] and this could hit overall performance [21:49:16] inflatador i discussed with d-causse some mitigation strategies if we encounter a worst case scenario, but hopefully we are not there yet? [21:50:29] btw - i'm still in the backport window deployment queue. My turn should come soon :D [21:50:52] gmodena ACK, do the backfills and/or A-B test times overlap with the pool counter rejection times on that graph? [21:51:25] inflatador yes [21:52:10] but i'm not yet familiar enough with all the moving parts to say that correlation implies causation :) [21:53:07] gmodena ACK, same here ;) [21:53:41] inflatador if things look fishy, wanna pair sometimes during the week? [21:54:33] For sure! I haven't done this kind of deep dive in awhile. I'll get a ticket started [21:55:48] gmodena I added you as optional to the weekly pairing I do with d-causse, if there is a better time for you feel free to send me an invite [21:57:13] inflatador thanks! [22:41:07] inflatador ryankemper fyi: the mlr a/b has been disables [22:41:14] metrics and logs look good [22:42:24] gmodena ACK, thanks. I'm writing up T387176 and there is a ticket for enabling the performance governor for Elastic hosts in T386860 ...that really helped in the past for WDQS hosts [22:42:25] T387176: Investigate eqiad Elastic cluster latency - https://phabricator.wikimedia.org/T387176 [22:42:25] T386860: Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860 [22:43:07] inflatador nice! [23:13:08] experiment logs have almost entirely faded from kafka, and latency metrics still look good. I'll give it 15 more mins and then I'm heading to bed :)