[08:44:39] Looking at https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&from=now-30d&to=now&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=14&viewPanel=21 , it seems that we have a minor performance degradation after the upgrade. [08:45:38] It is more pronounced on the 75 or 95 %-ile graphs. It seems minor enough that we probably want to ignore it. Any other opinion? [09:00:50] we run active/active now so we're not comparing exactly the same setup [09:02:02] and this graph is blending many different types of requests [09:03:59] hm but looking at fulltext queries yes seems like we have +10ms since the upgrade [09:04:01] https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&from=now-30d&to=now&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=14&viewPanel=44 [09:10:52] morelike took a bit hit too [09:16:20] I would expect active-active to give us better performances, not worse ... [09:16:51] load is now shared across more machines, and all requests are local to the DC [09:19:38] true but traffic might not be spread equally, e.g. codfw might be taking more smaller wikis, but I doubt that's the case [09:25:14] time to get Oscar from school and I'll be off for the afternoon. I don't think I'll make it for the retrospective, but who knows... [09:28:37] it's worse on all query types, +2ms on comp_suggest too :/ [10:18:45] lunch [12:47:19] o/ [13:24:03] to be clear on the performance hit, this didn't manifest when we were on ES7 but only active in one DC, correct? [13:25:40] inflatador: I'm comparing latencies we see now vs the ones we had before we started upgrading ES7 [13:26:04] now we run active/active so in theory eqiad should see less traffic than before [13:27:07] in https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&from=now-30d&to=now&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=14 [13:27:56] if you open "Full text percentiles" the hole is when traffic was served in codfw but the difference is visible before and after this hole [13:28:16] fewer qps [13:28:21] but higher latencies [13:30:19] when did active/active go into production? Before we upgraded to ES7? [13:30:48] roughly at the same time sadly [13:32:09] Damn. I wonder if we can temporarily disable active/active . I doubt it's the problem though [13:32:25] strangely we see latencies slightly increasing couple days before all traffic went to elastic@codfw [13:36:10] morelike cache hit dropped from 76% to 43% tho [13:37:16] around 2022-08-31 [14:03:06] just restarted the streaming updater for wcqs in k8s, will do the same for wdqs now [14:05:36] dcausse cool, sorry I didn't catch that [14:05:46] inflatador: btw you removed the runbook from https://wikitech.wikimedia.org/w/index.php?title=Wikidata_Query_Service/Flink_On_Kubernetes&diff=2006399&oldid=2006398 but I think you forgot to re-add it to https://wikitech.wikimedia.org/w/index.php?title=Wikidata_Query_Service/Streaming_Updater :) [14:06:19] it's a nice piece of doc, we should def keep it I think :) [14:06:32] you're right; I did forget...oops! [14:07:16] also writing a couple lines regarding how to cleanup swift will be very valuable I think :) [14:09:52] oh yeah, good call. Still in off-site, but should be able to finish today. [14:10:17] oh sure no rush! :) [14:18:35] looks like the k8s alert's back, there's also an ML k8s alert in operations...any idea if they use the same k8s cluster? [14:18:59] inflatador: they should be on a separate cluster [14:19:18] the alerts should resolve themselves soon I think [14:19:44] RdfStreamingUpdaterFlinkJobUnstable is a bit fragile when the job starts [14:22:05] ACK [14:25:41] inflatador: when you have a couple minutes I pushed a quick&dirty alert on the space usage (https://gerrit.wikimedia.org/r/c/operations/alerts/+/834008). I think we want more granularity here but I think it's safer to have something less accurate sooner than later (goal is to be warned soon if something crazy starts happening again) [15:15:52] inflatador: retro? [15:15:54] inflatador: retro https://meet.google.com/eki-rafx-cxi?authuser=2 [16:07:55] mpham: i'm not sure what else to put in that ticket, https://phabricator.wikimedia.org/T306899#8207172 is my confirmation that the deployed fixes look to have resolved the issues we were able to reproduce. I suppose the additional patches after that might make things less clear, but those weren't about the 500 rather they were about being able to write sane usage docs for bots (i guess i [16:07:57] put patch on wrong ticket) [16:15:07] ryankemper, inflatador : pretty long day already. I'll skip our pairing session for today. [16:16:42] gehel ACK, get some rest [16:30:07] looking at ApiFeaturage code I wonder if it's not sending a query to all indices [16:31:28] and that previously only the filter on _type:api-feature-usage-sanitized was actually helping [16:32:37] dcausse: huh, indeed $indexes gets populated but then it doesn't do anything with it [16:33:24] going to add a call to addIndices but wondering if we wont hit a limit there [16:33:38] dcausse: wondering about url length? [16:33:42] yes [16:33:59] it's rarely used, we could probably use a * [16:34:23] sure [16:34:51] inefficient, api-feature-usaapifeatureusage-* [16:35:04] blah, typing and not finishing :P but apifeatureusage-* is probably fine [16:35:37] :) [16:35:40] ok doing this [16:36:51] i wonder what kibana does. Poking at my network tab when i say query 12 days it issues index as "logstash-*", but they could have some intermediary bit [16:37:13] i guess it's not kibana anymore, "opensearch dashboards" [16:39:10] and i suppose noone really cares, but it's slightly annoying that it calls https://logstash.wikimedia.org/internal/search/opensearch which doesn't implement the opensearch standard :P [16:39:20] inflatador, ryankemper: could you use today's pairing session to have a look at the backlog and see if there is something we need to be moving forward during next quarter? [16:41:18] gehel ACK, will do [16:58:52] I'm a bit late but congrats on getting the Elasticsearch upgrade complete! [16:59:15] tltaylor: thanks! :) [17:27:07] ebernhardson re: our rate-limiting discussion yesterday, looks like the hiera variable for public clouds is 'public_cloud_nets' if we want to use that anywhere [17:44:00] lunch, back in time for SRE pairing [17:49:35] dinner [18:19:10] back [18:26:43] ryankemper will be ~5m late to SRE pairing [18:26:59] inflatador: ack [18:34:38] back [19:05:39] hmm, so looking through the related vcl for public_cloud_nets it seems the X-Analytics header should have public_cloud=1 as one of is ;-delimited values. I'm not 100% sure if X-analytics is delivered to mediawiki app servers but seems plausible...more looking [19:09:32] looks like whatever was hammering codfw finished, it ran 14:00-22:00 yesterday [20:19:47] hmm, not having great ideas....the mjolnir msearch daemon is currently not doing anything because in an active-active situation it pauses consumers in both dc's (we knew this, but didn't have great ideas earlier either :P) [20:21:18] perhaps can have it ask elasticsearch about load from the node stats api and set up a per-record delay (throttling) that increases whenever node stats reports a node above some threshold [20:24:34] but still have the complication of how to know which cluster it should query, a simple threshold like we have now seems too naive, at the low point ofa day eqiad sees ~1.9k enwik_content full_text shard qps, and codfw peaks just over 2.2k [20:28:48] i suppose the whole daemon is a bit more pointless these days since yarn<->elasticsearch is no longer firewalled. But then we still need the load-based throttling just on the other end of the system [20:33:10] would be cleaner if we could have some sort of middleware on the elastic side that could issue 429's and be configured by the request [20:41:07] maybe can reuse the DegradedRouterQueryBuilder i wrote somehow...it's able to choose different queries based on server load...will have to test maybe can turn that into an EsRejectedExecutionException instead of running a degraded query and throttle based on that...hmm [20:41:18] more thought necessary :P [20:44:00] ebernhardson sorry, ryankemper and I've been working on this epic motd patch ;) [21:07:40] lol, no worries i'm talking to myself mostly anyways :P [22:06:30] ahh the tried and true way to reclaim memory from chrome...kill -9 and restore without tabbing into all the extra tabs :P [22:29:20] cool, it works. Throwing an EsRejectedExectionException during query rewrite sends a 429 back to the client [22:29:33] * ebernhardson should learn to spell some day [22:30:32] but now i have to decide what it should look like, the impl i have now is awkward but "works" [22:30:51] maybe awkward is fine [23:20:36] "maybe awkward is fine"—I feel like that is how a lot of us get through life... ¯\_(ツ)_/¯