[09:50:25] ebernhardson: I think that T335499 has been fully deployed. Am I missing something? [09:50:26] T335499: Ensure that we collect appropriate data for Search platform SLIs - https://phabricator.wikimedia.org/T335499 [09:58:27] weekly update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-09-15 [10:01:26] lunch [12:12:49] dcausse: I just noticed this branch: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/compare/main...test-mem-opt-runner - Should there be a merge request for it? [12:14:44] pfischer: I'm still testing and need to ask some guidance from releng, they made the mem-opt runners a bit more stable but they're still a bit slower (+1min) compared to wmcs so I'm not sure we should go for it yet [12:18:42] Ah, sure. Thank you for looking into this. [12:25:57] dcausse: Would you have time to discuss CirrusDocEndpoint (in ~30min?)? I’m a bit torn, if the distinction still makes sense In the light of the latest decision to always request links (for rev and rerender). [12:26:26] pfischer: sure [13:02:50] dcausse: https://meet.google.com/jyu-qjua-pmk?authuser=0 [13:30:45] o/ [13:54:02] dcausse no hurry, but we have lot of wdqs hosts in CODFW, if you wanted to use some of them to work on the graph splitting stuff LMK [13:55:34] inflatador: thanks! we'll certainly need a couple of hosts (3 I think). But I guess for several months perhapds a year [13:56:26] ACK. I'll comment on the epic re: available hosts [14:29:51] dcausse I just merged https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/957311 , I'm up at https://meet.google.com/aod-fbxz-joy to do the operator deploy. [14:30:02] sure [14:32:34] inflatador: realized, i'm an idiot. the mjolnir package on gitlab ships it's own python interpreter. It shouldn't need the system python [15:21:39] ebernhardson interesting. I guess we need to rip out some puppet code so it doesn't try to install old python [15:22:08] also, we got the flink-ZK stuff working! [15:22:25] inflatador: nice! [15:22:51] thanks for your help on that [15:23:00] np [15:23:36] I totally lack of context but AFAIK we shouldn't run non OS-provided python, if that's in prod please check with mor.itzm for an authoritative answer ;) [15:25:20] volans: hmm, this is the default deployment method for python in the analytics side, i suppose it might be a bit unique that we reuse the same package on the prod side [15:25:54] which python do you need? [15:26:09] this is on 3.7, which was EOL starting july [15:26:23] i mean, probably we should update it, but i was trying to avoid blocking the task to update the os on other things [15:27:40] sorry I'm confused, do you need 3.7 on a more recent OS or you have 3.7 and want something more recent? [15:27:58] 3.7 on a more recent OS is the immediate need, so we can get rid of Buster [15:28:16] we are currently on and old debian, with a task to update to the new debian. The software runs on 3.7 [15:28:42] long-term we do need our app to run on non-EOL python, see https://phabricator.wikimedia.org/T346373 [15:28:58] ah ok [15:29:27] so faidon's backports are not useful here, you need the opposite [15:30:58] FWiW I think it's better to focus on getting the app updated, as we're not the only ones still using Buster ;) [15:32:12] can look into it, the main problem there is the ancient version of conda. The only thing i've gotten to work so far is to dump the dependency list from the new conda and feed it into the old conda. Otherwise the dep resolver gets stuck [15:32:18] at this point you cold consider skipping buster too :D [15:33:13] Bullseye? Y, I was thinking about that too [15:33:22] We're also considering migrating to k8s [15:33:39] given the constraints and my lack of context I don't know what teh best course of action, sorry for the distraction [15:33:58] np, you actually brought some focus on this [15:34:05] s/buster/bullseye/ I meant go directly to bookwom [15:34:30] Is there something that would explain a sudden x20 (yes) increase in requests from WDQS updater bot to mw-api-int? [15:34:53] https://logstash.wikimedia.org/goto/a0edc67230876f3ae05578c482237626 [15:35:05] https://grafana.wikimedia.org/goto/9iYqoTmIk?orgId=1 [15:35:29] claime I hope not? we just recently did a test app deploy in dse-k8s . dcausse any thoughts on this? [15:36:02] let me check superset too, it could just be someone sending ridic queries [15:36:37] I'll temporarily scale up [15:36:59] lost state on the updater and it's trying to catch up as fast as it can? [15:37:36] claime: we resumed the test we have in dse-k8s and it's backfilling [15:37:49] happened earlier this week too (https://logstash.wikimedia.org/app/dashboards#/view/018bde90-a08d-11ed-8137-c3b9b9c0225e?_g=h@463f696&_a=h@eee302c) [15:38:04] Well I need to add a bit of cap then [15:38:59] we can lower the concurrency on our side too [15:40:12] I'll see if I can fit two more replicas on my side and if it's enough [15:40:36] thanks! [15:41:32] we should maybe find a way to load balance this accross DC if it's not writes [15:42:50] And probably just 2 replicas more won't be enough, so lowering concurrency might still be necessary [15:42:58] sure [15:46:31] I'll throw two more at it [15:47:22] dcausse is it just `parallelism` that we change in the flink conf? [15:47:38] inflatador: no, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/957955/1/helmfile.d/dse-k8s-services/rdf-streaming-updater/values.yaml [15:48:06] * inflatador never would've guessed that [15:49:20] anyway, I'll get that merged and redeployed [15:50:17] thanks! [15:54:26] claime just deployed on our side, let us know if you're still seeing high RPS [15:56:04] It's still elevated (~600), which is half what it was at peak, so combined with the replicas increase, latency is back under control and so is congestion [15:57:54] We should discuss it when I'm not about to leave for the week end, but is that something we're expecting to happen regularly? Because we are going to move more and more production to use mw-api-int as a backend, so we're going to have to be more careful around huge spikes like this until we have integrated all the new hardware and can scale properly [15:58:02] I was just thinking the same thing [15:58:36] This would happen every time we redeploy or have a pod failure [15:58:44] dcausse ^^ feel free to QC me on this [15:59:44] such backfills are super rare [16:00:01] I'm not sure why this new approach would be more intensive than the old one [16:00:15] it happens here because we made a completely new savepoint last week [16:00:28] and it has to process 2weeks+ of updates [16:00:42] we're nevery (rarely) in this situation [16:01:26] OK, then I guess we'd have to be offline for more than a week before this came up again [16:02:28] ok, I got very scared by "This would happen every time we redeploy or have a pod failure" [16:02:33] Unless I goof something up really badly, that probably won't happen. If it does, we know to give a warning to ServiceOps. Will [16:02:45] update our docs accordingly [16:03:38] I'd have expected the concurrency decrease to be more effective... was expecting a /6 reduction, will take a close look [16:08:26] Can you make a task with the results of your investigation dcausse so we can keep the rest of ServiceOps in the loop? [16:08:35] sure [16:12:05] ty :) [16:45:00] does anyone know how to remove MRs that are in "your turn" from the gerrit dashboard? I've got a few that have post-merge comments that aren't actionable [16:52:42] "ack" the comments? [16:53:58] on the change you can remove yourself from the attention set if you want [16:54:33] click on the ">" precededing your name [17:25:35] nice! Thanks d-causse [18:00:46] started on the path naming patch. I can't remember what we used last time, but I don't have strong feelings on the name https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/957967/ [19:02:17] * ebernhardson finds it randomly curious that the flink configuration parsing uses Integer.class.equals(clazz) to check int/float/bool/etc, but clazz == Map.class for non-basic types. I wonder why they allow extended implementations of the basic types but not others [20:36:27] late lunch, back in ~30 [21:04:55] back