[08:24:07] Last night, SUP was killed by a 503 from ES https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2024.15?id=PV7GxI4B8T_a4T-erTul - Looking at the cloudelastic grafana dashboard [08:24:07] https://grafana.wikimedia.org/d/000000460/elasticsearch-node-comparison?orgId=1&var-cluster=cloudelastic&var-exported_cluster=cloudelastic-chi&var-dcA=eqiad%20prometheus%2Fops&var-nodeA=cloudelastic1005&var-dcB=eqiad%20prometheus%2Fops&var-nodeB=cloudelastic1005&from=now-12h&to=now it appears as if it has been running without interruptions. Is this something to be expected (and retried)? [08:27:24] pfischer: yes the sup should definitely survive when these 503s happen, what's not normal and should be looked into is why the job did not restart [08:29:15] not the first time we see it giving up on restarts too early, the restart strategy seems a bit weak imo and might be fine tuned I think [08:51:36] I’ll add that to the retryable exceptions to avoid the restart overhead [08:54:11] we use FixedDelayRestartBackoffTimeStrategy which is given 10 restarts for the whole life of the job [08:55:07] the restart counter never seems to be reset [08:57:44] Exponential Delay Restart Strategy might be more appropriate for us perhaps [08:59:26] Was there something changed in flink-app-consumer-cloudelastic? The volume of requests it does to mw-api-int had between doubled and quadrupled since the beginning of april [08:59:36] https://grafana.wikimedia.org/goto/SOgR5VaIg?orgId=1 [09:06:48] claime: yes, pfischer has worked on increasing its throughput in the recent days [09:08:22] and because it's still a bit unstable it often has to catchup pushing the pipeline to max throughput [09:08:27] Hmm ok that means we need to prioritize scaling mw-api-ing up a bit, because it's pushing us over threshold for worker saturation [09:09:03] s/-ing/-int/ [09:10:19] claime: it's probably a matter of agreeing on a budget for us if that is too much for mw-api-int? [09:15:05] I think if you keep to ~1krps max for now we can handle that [09:15:30] pfischer: ^ [09:15:44] We may be able to accomodate more than that once we've moved more appservers and can jiggle things a little better between the different deployments [09:16:15] (by moved more appservers, I mean grown the wikikube k8s cluster with repurposed appservers) [09:16:54] sure, we should be soon moving a lot of requests off of jobrunners (using this new pipeline using mw-api-int) [09:17:36] claime: sure, I’ll rate-limit our fetcher [09:17:38] yeah, which means we would be able to scale jobrunners down a bit, and mw-api-int up [09:17:42] pfischer: tyvm <3 [09:20:34] * pfischer thinks about auto-rebalancing rate limits across parallel flink operators and concludes: that’s not trival [09:23:52] would be great if envoy or the like could handle that transparently for us but I don't think we enable anything like that, so the easy way to go is tuning the fetcher capacity [09:27:05] Yeah, we don't really have that capability in how we use envoy right now [09:27:38] It may be enabled by some of the work being done on a control plane for the service mesh, but it is quite some way down the road (as in, for now, a pipe dream) [09:27:55] Hm, event then, envoy would respond with 429 (too many requests) and we would then retry which congests our fetch queue. So we might as well reduce the queue size in the first place. [09:30:06] true, some tuning will have to be done anyways indeed [09:40:54] dcausse: BTW: here’s the 503 fix: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/115 [09:46:31] looking [10:04:24] Thanks! I’ll add a restart strategy, one sec. [10:05:06] pfischer: if I'm not mistaken the restart strategy is configured from helm files? [10:06:12] Both is possible, you can set that per application with or let the application use a fallback strategy, so it uses whatever the cluster implies [10:07:06] But maybe it’s better to keep it in the helm file [10:08:55] I'm fine either ways but it seems that it's already setup in values-main.yaml [11:01:23] lunch [13:09:48] o/ [14:56:08] pfischer flaky internet again [15:03:09] in p+t meeting, will join the wed meeting later [15:03:28] ^^ same [15:04:18] will ask about reproducing a 503 in #wikimedia-k8s-sig [15:47:40] Still getting a lot of packet loss on my cable connection...will switch to hotspot if this keeps up [15:59:18] pfischer this is the scanning repo I was talking about [16:06:05] https://github.com/ossf/scorecard/releases/download/v4.13.1/scorecard_4.13.1_darwin_arm64.tar.gz [16:06:08] oops [16:06:16] https://github.com/ossf/scorecard [16:07:08] workout, back in ~40 [16:58:31] nsvk [17:39:41] dinner [17:45:04] lunch, back in ~40 [18:22:52] back [18:31:50] bit i missed in the flink docs before: Internally, back pressure is judged based on the availability of output buffers. If a task has no available output buffers, then that task is considered back pressured. [18:41:09] cable's still sitting at ~10% packet loss...going to try hotspot [21:30:12] me wonders how we are supposed to count clicks to the skin autocomplete that originate on Special:Search. Currently they are counted as both [21:30:27] probably as not special:search :P [22:41:28] * ebernhardson is sadly disapointed trying to copy from jupyter with 2yy