[08:28:29] gehel: I am sorry, I forgot to reject the JVM meeting invite. Thanks for moving the SDK Man ticket forward! [09:05:01] pfischer: 1:1 ? [09:06:38] Sorry, 1sec [10:04:27] lunch [11:40:35] lunch [12:11:53] SUP consumer jobs seem down since 3am [12:12:17] dcausse: looking [12:12:22] java.lang.IllegalArgumentException: The request entry sent to the buffer was of size [10205204], when the maxRecordSizeInBytes was set to [9437264]. [12:12:55] weird, that should be filtered [12:13:15] https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2024.25?id=uKFnMJABAJJzGk1BfJMO [12:13:34] I could relaunch the pipeline with an increased limit [12:13:45] just to let it pass [12:13:55] can investigate later [12:13:58] sure [12:23:13] restarted it with 5 instances to catch up quickly, rate limiting will be configured, too [12:30:44] dcausse: If you have a moment, we use this unexpected backfill to 429-retry at client level: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/140 [12:31:08] looking [12:32:13] fetch error rate is rising 133 msg/s [12:32:58] shall I stop it for now? [12:33:23] unsure if it's because ratelimiting tho, looking at the topic [12:34:14] I’ll check the logs [12:36:29] kafkacat -b kafka-main1005.eqiad.wmnet:9092 -t codfw.cirrussearch.update_pipeline.fetch_error.rc0 -o end | grep "429 Too Many Requests" | pv -l > /dev/null [12:36:44] shows ~130 evt/s [12:37:03] pfischer: perhaps redeploy with normal parallelism? [12:37:23] reviewing your patch [12:37:51] Sure, I’ll restart… [12:46:35] pfischer: I see 10 retries by default, should we always retry? and possibly let something else fail if the pipeline comes to a stall? [12:51:15] dcausse: Yes, I was not sure if we should cap it, too. I’m fine with retrying forever [12:51:54] This is where the overall async operator timeout might be handy. [12:56:32] yes was about to comment on this [12:56:47] I think we should perhaps make it higher [12:57:10] there's no real way to know the theoretical max wait time now [12:57:36] so perhaps we should hardcode (via config) to something relatively high (5min?) [13:00:15] Sure, I’ll update my PR. Thank you for looking into it! [13:00:29] * pfischer needs better alerting on alert e-mails [13:01:00] I missed this one two... just saw the alert on IRC [13:39:02] Yes, I already made contact [16:53:02] dinner