[09:58:58] lunch [12:42:37] o/ [13:55:04] inflatador: are we ok for the switch upgrade in E2? [13:55:44] topranks checking [13:56:15] it's elastic1091 & elastic1092, plus wdqs1018 and wdqs1020 [13:57:51] topranks OK, we're ready. Sorry for not getting to that sooner [13:58:01] no probs at all thanks! [14:23:56] upgrade is done if you want to repool [15:00:33] office hours on https://meet.google.com/vgj-bbeb-uyi [15:00:57] dcausse, pfischer, ebernhardson, dr0ptp4kt ^ [16:04:48] picking up my cat, back in ~20 [16:08:09] Trey314159, pfischer uploaded https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1051798 if you have time please feel free to +1 if you don't see any issues and I'll try to ship this tomorrow [16:08:25] dinner [16:20:02] back [18:12:20] lunch, back in ~45 [18:50:35] hi, are there any Extension:CirrusSearch understanders in the chat? :) [18:51:14] dcausse is probably already away, ebernhardson is on vacation. Maybe dr0ptp4kt knows enough. [18:51:21] Or very unlikely me... [18:51:31] * inflatador is also in the "unlikely" camp [18:52:02] pfischer might know some as well, but should also not be around at this time. [18:52:24] I just noticed, there are a large number of request traces that look like this one: https://trace.wikimedia.org/trace/385340f9223ae3ef94a18ec5c5f3f336 [18:52:52] a user call to Special:Search with a query, then five parallel calls back the Mediawiki API for the same cirrus-config-dump [18:53:05] I can't imagine that's intentional behavior [18:53:53] does this seem like a new issue, or has it being going on for awhile? [18:54:02] very likely going on for a while [18:54:07] (at a guess) [18:54:16] the new thing here is having distributed tracing :) [18:54:24] I know, it's awesome~ [18:54:35] interesting... but this will need to wait until tomorrow [18:54:39] no worries, I can file a task [18:54:48] ACK, was gonna offer but you have the most context [18:54:52] please do! and tag Discovery-Search [18:54:58] I'm quite confident they are separate requests though -- you can dig into each one and see that it was routed to a different mw-api-int instance [18:55:01] ok will do :) [18:55:33] oh, I also owe you a task about Extension:CirrusSearch not propagating the headers we need to get its queries to elastic included in the traces [18:55:34] cdanis: are requests parallelized? [18:55:59] gehel: according to the trace, these are, yes [18:56:23] yeah, that's what I'm seeing on the trace, but that surprises me even more [18:56:27] right? [18:56:58] * gehel is looking forward to having elastic included on the distributed traces [18:57:55] right now we can see traffic towards it, but it's not connected to any parent requests: https://trace.wikimedia.org/search?service=search-omega-eqiad [18:58:01] is the distributed tracing something we could extend to more systems? I'm specifically thinking Blazegraph. With internal federation, that might become useful [18:58:19] gehel: yes, assuming the systems in question support propagating the usual opentelemetry headers [18:58:48] anything that propagates `traceparent` and `tracestate` should 'just work' when running with our mesh on k8s [18:59:02] there's some work for bare-metal I haven't done yet, but the pieces are there and it's "just" a matter of putting them together [18:59:35] blazegraph is definitely in the baremetal category. Adding opentelemetry to it should not be too complicated (but some work) [19:00:02] (to be clear, by propagate I mean "copies the incoming request header to any outgoing request headers made for service calls 'on behalf' of that incoming request") [19:00:10] (like we generally already did with x-request-id) [19:00:39] okay, I need a snack now but I'll file tasks before this all swaps out of my brain [19:00:45] thanks for the quick look! [19:01:23] actually, now that I think about it, it's going to be a mess for Blazegraph. Too many thread pools, so no way to easily match incoming requests to outgoing requests without being quite invasive [19:01:53] gehel: surely there's a per-unit-of-work piece of context somewhere being passed from pool to pool? [19:03:27] yes, of course, but that's blazegraph internals, so it would need some deep changes. In the more common case of single thread, there is a thread context that can be used, independently of whatever the application is doing. So it becomes very easy to add that kind of behaviour, without touching application code. [19:05:31] can you shape the top-level federated queries such that you inject a sparql comment with some machine-readable tags in each of the subqueries? [19:06:33] some prior art: https://google.github.io/sqlcommenter/ [19:06:57] anyway, actually going AFK for a bit now :) [19:37:40] these are requests made to fetch the config of sister wiki to perform interwiki searches, they should be heavily cached... 200ms to fetch the mw-config does not right tho :( [19:52:50] hm... this should use the wan object cache but not finding the cache key group in https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?orgId=1 ... [19:54:38] ah maybe not, it might be using the local server cache but the code is a bit messy so that might be a mistake [20:11:33] workout, back in ~40 [20:52:31] back [21:18:35] g.ehel and c.danis, apologies - i was taking a certification exam (🤞). i'm winding down shortly, for holiday and vacation. wishing you and everyone here well. [21:18:58] dr0ptp4kt wishing the best for you and your CKA status! [21:19:55] thanks inflatador - i'm suspecting borderline. i had one process i was proficient enough in, but i misread part of the instruction, and so burned about 10 minutes more than i wanted to on that. caveat emptor! talk to you later, be well!