[08:04:24] <_joe_> btullis: is it possible that the high rate of memcached errors from the dumps jobs is related to your change [08:07:27] <_joe_> the timing seems suspiciously similar https://logstash.wikimedia.org/goto/2a745ba1802ed97e53c76349982b519a [08:07:44] _joe_: I doubt that it's related to yesterday's change. That was only flink related. I can't immediately think why it would have started. [08:08:34] <_joe_> btullis: in any case, it seems to be a consistent problem for dumps since yesterday at 13:00Z [08:09:29] Yep, I will check it out with high priority. [08:11:00] <_joe_> and yes I also don't see how your change could cause this, tbh [08:14:16] but it's definitely in our wheelhouse, as the saying goes. [09:04:23] I still have no idea why they suddenly started, but we now have lots of failed requests to memcached on 127.0.0.1:11213 - e.g. https://logstash.wikimedia.org/goto/d6588eb85ff96a5e9724b530a14ba9b0 [09:05:08] We don't run an mcrouter container in our mediawiki pods and we never have. But perhaps we should. [09:10:38] https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/dse-k8s-services/mediawiki-dumps-legacy/values.yaml#L199 [10:22:31] <_joe_> or run them as daemonsets, yes [10:31:01] Thanks. Can we silence or filter out these alerts while we work on this? I'm still unclear on why they suddenly started, seeing as we have been running these dumps for ~ 3 months now. They have been working fine without a connection to memcached, so I would like to understand the benefit of adding mcrouter to the setup, other than reducing the noisy errors. [14:16:10] thanks for looking into this, btullis. indeed, if these have been running all along, then it's puzzling as to why we're only seeing the errors now / so suddenly. [14:16:10] I believe the only way to suppress the errors in the alert would be to update the expression to exclude dumps temporarily. is there a task I can reference in a TODO adjacent to where I would do that? (in which case, I can do so) [14:16:10] although, oddly, it looks like this stopped around 9:20 UTC today? [14:21:21] swfrench-wmf: hmm.. might it have to do with per-wiki configs? [14:24:22] could very well be, yeah - I would say that my knowledge of "how dumps works, including what wikis are processed when and on what schedule" is near zero :) [14:42:33] It seemed to be affecting only wikidatawiki and commonswiki, but we also got some similar alerts from the wikibase dumps. [14:45:20] swfrench-wmf: If you have a patch, you could attach it to this task: T352650 - or if you wanted to make a new task for handling itm then you could add that as a parent. [14:45:21] T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes - https://phabricator.wikimedia.org/T352650 [14:47:19] There is some background information on how dumps works now, here: https://wikitech.wikimedia.org/wiki/Dumps/Airflow [14:47:50] I could also give an overview in an SRE staff meeting some time, if that would help. [14:52:42] thanks, btullis!