[07:39:09] Oscar is sick and staying home, my availability will be reduced today. [07:46:17] hope he gets better soon, take care! [08:43:11] I wish him a fast recovery! [08:51:55] dcausse: Would you have a moment to review https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/14, please? I extracted common.config.CommonConfig as part of this which would affect two more PRs from Erik. 🙊 [08:54:38] pfischer: sure [08:58:13] errand, back in a few [09:05:51] pfischer: not sure to understand why some tests have changed? e.g. https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/14/diffs#diff-content-4252edc05e199d685628fb38e0acd91e4fdd2015 [09:07:08] perhaps to use java9 features, e.g. ImmutableMap.of -> List.of [09:08:12] dcausse: MediawikiPageContentChangeEnrichJobManagerNotRunning, does this ring a bell? [09:09:05] gehel: this is the flink job for the page enrichment, Gabriele should know what to do [09:09:10] that seems to be on the DE side [09:09:40] seems like mostly test methods that were moved to different places [09:15:36] dcausse: thanks! I'm talking about it with Gabriele [09:16:20] dcausse: The new feature implemented in this PR is HTTP routing with pattern matching (hence the changed tests). Since this is needed for both, producer and consumer, I decided to extract CommonConfig. I might have overshot by renaming config properties (another cause for changed tests). [09:17:34] dcausse: let me check why it reordered the methods… [09:17:40] yes just saw that the signature changed from Map to List>, but it seems like methods were shuffled a bit in the file [09:18:08] no worries tho, just a bit harder to review [09:26:19] dcausse: I removed the superfluous changes. [09:26:27] thanks! [09:57:00] til lombok SuperBuilder (thanks to Peter), very powerful but slightly confusing [09:59:15] Yeah, agreed. I’m not perfectly happy with it either. But I like the immutable config that comes out of it. 🤷 [10:42:33] lunch [13:14:48] o/ [14:14:51] hi, since 14:02 roughly, we get a bunch of ` ApiUsageException: Search is currently too busy. Please try again later.` errors [14:15:24] all from commmonswiki (I got hit by it while doing a Special:MediaSearch [14:16:25] dcausse, inflatador: any chance you could have a look at the increased Cirrus latency ? [14:16:43] sure [14:26:44] that's exactly following the switchover btw [14:26:52] (we flagged it on -sre) [14:29:39] claime: we run only codfw, right? [14:31:49] yeah [14:32:43] ok, we suspect that the cirrus query cache has to warm up [14:32:59] claime are things looking better? We're looking at the cirrus query cache stuff [14:34:35] yeah it's trending back down to baseline [14:34:50] our previous assumption was that since codfw was already receiving traffic before the switch, the caches should already be warm enough. Sounds like we were wrong? [14:35:48] yes seems like the cache is not the same [14:36:17] that's an interesting learning in itself [14:36:17] traffic pattern are perhaps very different between eqiad & codfw? [14:36:19] claime: do you know from the top of your head if the warmup script does warmup also some search or just wiki pages? [14:36:24] I guess because there's usually much more traffic in eqiad? But that's just a guess [14:36:31] We didn't run the warmup script [14:36:37] yes I know [14:36:40] that's why I'm asking [14:36:41] :D [14:36:41] Ah [14:36:48] Err not off the top of my head no [14:36:52] if it would have helped or not in this situation [14:37:01] or we might need a dedicated warmup step for search [14:37:51] we also thought that the cirrus query cache was replicated between the two DC but apparently not [14:37:55] I vaguely remember we had a dedicated warmup procedure for Search. But that might have been directing a higher percentage of Search traffic to codfw before the switch. dcausse or ebernhardson probably know better than I do [14:42:18] banning one heavily loaded node did help a lot tho [14:42:40] could be a poor distribution of the shards that was not detected while we had low traffic on codfw [14:43:29] we previously kept more_like queries on eqiad until the codfw cluster had time to warm up: https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&oldid=2056048#Preserving_more_like_query_cache_performance [14:45:59] looks like 30 minutes of degraded operations. Not good for a planned switchover, but that gives me confidence that in case of an unplanned one, things are going to be mostly ok [14:47:22] in retrospect I see a few short-lived alerts from earlier this week for `CirrusSearch eqiad 95th percentile latency` [16:01:49] Does anyone have strong feelings on whether or not the search issues should be an incident? I'm thinking yes [16:03:49] It had a pretty visible user impact, so that sounds nds like something that deserves communication. We also have work to do so that this does not happen during the next DC switch. [16:04:09] That sounds like it makes sense to have an incident doc [16:04:27] gehel ACK, taking a break but will start when I get back [16:41:51] back [17:22:57] going offline [17:32:46] thanks for your help d-causse! [17:33:15] https://wikitech.wikimedia.org/wiki/Incidents/2023-09-20_Elasticsearch_unavailable incident report here, feel free to add/change anything [17:42:29] lunch/doctor appointment, back in ~3h [20:23:02] back