[16:30:51] ebernhardson: dcausse: If you have a moment: I implemented client-side rate-limiting to see if that reduces 429 (presumably caused by bursts of requests): https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/141 [16:37:12] pfischer: looking [16:44:38] pfischer: i'm not particularly familiar with guava rate limiter, but can probably assume it's reasonable. The integration looks simple and straight forward [16:51:34] i can probably ship that out and monitor it a bit today, i assume you're close to done [17:03:19] back [17:41:55] dinner [17:59:46] pfischer or dcausse is there a ticket for the doc size/flink crash loop issue we talked about @ retro? I can make one if not. Just looking to associate a ticket with my MR for alerts [18:08:51] oops, lost track of time...will be ~10m late to pairing, getting lunch now [18:12:14] inflatador: no ticket afaik [19:00:34] ebernhardson ACK, will get one started [19:16:42] dcausse: pushed a couple of small fixes to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1048038 jfyi [20:10:12] inflatador: I have not created one yet. We ran into this earlier this year already. The cirrus extension, that provides the SUP with the documents to be indexed, should take care of truncating documents > 8mb. But sometimes that does not work apparently. We increased the limit we are willing to accept on the SUP’s with some head room (12mb). I don’t know if it’s worth investigating why cirrus fails to truncate at [20:10:12] 8mb. [20:11:23] ebernhardson: Thanks! I have a few more minutes (have to catch up some toddler-watch-time) [20:11:34] pfischer no worries, I created a task just for the alerting side (T368107) If we need more alerts for that specific scenario LMK [20:11:35] T368107: DPE SRE: Increase visibility of Search Platform alerts - https://phabricator.wikimedia.org/T368107 [20:17:28] inflatador: Thank you! I think we had enough red lights flashing, I’am still fiddling with gmail to make those particular emails stand out. [20:17:47] no worries, it's a constantly struggle ;) [20:18:35] Here is the CR for adding the same alerts to DPE SRE. I just asked observability and they confirmed, we have to duplicate alerts, we can't have 2 receivers for the same alert ;( https://gerrit.wikimedia.org/r/c/operations/alerts/+/1048074 [21:16:49] hmm, retry attempts per retried update for rerender_fetch might have declined from 2 to ~1 with the deploy of internal ratelimiting, or it could be an artifact.. will have to check again later [21:21:55] staying quite close to 1, good sign [21:22:16] (unless something unexpected is happening elsewhere :P) [22:00:38] ebernhardson: I only deployed cirrus-streaming-updater-cloudelastic so far, and the client-side rate-limiting does not seem to have an effect, we still see up to 50 req/s of 429s while 200s come at 200 req/s: [22:00:38] https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-app=flink-app-consumer-cloudelastic&var-datasource=thanos&var-destination=All&var-kubernetes_namespace=All&var-prometheus=k8s&var-site=eqiad&from=now-3h&to=now [22:07:07] going hiking, back in 1.5hr