[12:38:11] akosiaris: I partially agree with your point on the critical, since seeing OOMs may impact our SLO metrics (when we'll have one and when we'll pay attention to burning budget etc..). Maybe a single container OOMing or getting very close to mem limits is not concerning, but if 20%+ of the containers in a deployment start to do it I'd argue it is a critical.. [12:38:56] just to understand - if now a container on wikikube or mlserve ooms or crosses some memory limits, do we see it popping up in alerts.w.o? [12:39:06] as warning I mean [12:39:50] if so it may be ok for our use case, we liked more to see IRC notifications but it may quickly get spammy [12:43:47] elukey: we have a warning if a container is using more than 95% of its memory limit for 10 minutes, and a warning if a container gets oomkilled at least twice in the last 10 minutes [12:44:02] KubernetesContainerReachingMemoryLimit and KubernetesContainerOomKilled [12:45:41] claime: yep yep I know, I saw the alerts config but never seen any occurrence of those popping up in alerts.w.o, so I was wondering if they were silenced or fully working [12:45:56] just to get an idea about the current state [12:46:01] Yes, warnings show up on rthe am webui [12:46:24] elukey: wdym? There's currently 2 of them for linkrecommendation and one for k8s-controller-sidecar [12:46:31] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=severity%3Dwarning [12:49:59] claime: ah lovely I missed them, checked yesterday [12:50:19] no worries <3 [12:50:28] I was afraid something was wrong x) [12:50:34] nono my head is wrong [12:50:57] elukey: depends on the workload again. The thing is, such behaviors are difficult to predict that they will actually lead to an outage or an incident. In many cases the workload gets throttled and somewhat starved of memory and asked to heavily reclaim stuff but it won't die nor return an error (nor killed). We 've seen that happen. The actual [12:50:57] thing you might notice is that latency for some requests increases in those cases. Which arguably the thing you probably want to monitor and alert on. [12:51:49] ideally you don't want to issue critical alerts for "saturation" metrics. You want to issue for things that cause problems to your users. Like errors, increased latency, SLO violations (sounding like a broken record, I know), etc. [12:52:21] SLOs being supposedly how you define what your users find acceptable ofc [12:53:14] okok point taken [12:53:17] but the transition from "oh look, X containers consume a lot of memory" to "and here we are, in an incident" isn't always there and even if it is, it might be a long time later. [12:53:47] SLO metrics are probably a better answer, but I am biased a little since they are not that widespread yet [12:53:49] I am not in love with those alerts in a warning state tbh either [12:54:43] I 'd much rather we just had a system that files a task, ACKs the warning and links to interesting data regarding this. But I suppose, midnight summer's dreams. [12:55:00] we do have that btw for broken disks, so not totally dreams [12:55:48] I think the danger with filing tickets is that those too lead fatigue _unless_ they are deduped/joined automagically [12:55:53] lead to* [12:56:16] yes yes true as well [12:56:33] klausman: wdyt about removing our alerts and leaving only the warnings? [12:56:44] Sounds good to me. [12:56:50] we'll (ML) need to check the alerts.w.o page, but it seems ok [12:57:11] I'll also add a README in our subdir that points out that other subdirs may have relevant alerts for us (the deply: bit is easy to miss, from my experience) [13:00:53] as a hint. Make sure you find the good filters that work for you team in alerts.w.o and bookmark it [13:01:02] if you try to make sense of the initial page, it's pointless [13:01:15] I am using https://alerts.wikimedia.org/?q=%40state%3Dactive&q=instance%3D~%28%5Eml%7C%5Eores%29 which works really well [13:01:39] yeah, something like that [13:01:39] state=active and instance=~(^ml|^ores) [13:02:00] the latter could be modified to avoid ores :D [13:02:07] So far I have not seen anything that wasn't relevant, and I think/hope that I didn't miss anything, either [13:02:31] Weeell, it's a pinned tab/bookmark and I created it when we still had to care :) [13:04:09] I am biased I know, it was too painful [13:04:37] I've updated the aletring patch accordingly [13:04:56] I have a couple, one limited to team=sre, and one limited to team=serviceops [13:06:25] I also started work for something to run in a terminal for those that don't want to 'waste" memory on a whole tab, but it's stalled atm. [13:08:49] completely unrelated - I'd need to build some docker images from production-images, is it still ok to run build-production-images manually? I know about the timer etc.. related to the weekly build, but I don't see anything else [13:09:24] elukey: yep [13:09:51] <3 [13:28:45] a lot of failed builds, sigh [13:32:21] ah lol https://phabricator.wikimedia.org/T350366 is still open [13:35:44] I'd be inclined to close this and open new tasks for every image failed, tagging the owners etc.. [13:35:57] _joe_ --^ ok if I do it? [13:50:02] <_joe_> elukey: sure but I'm starting to be tempted to just root them out [13:55:33] <_joe_> elukey: what images are failing? [13:56:49] openjdk, spark and flink afaics [13:56:53] I am opening tasks [14:11:29] I see also a failure for docker-registry.discovery.wmnet/php7.4-fpm-multiversion-base [14:21:41] <_joe_> elukey: uh that is new [14:21:56] <_joe_> elukey: openjdk would make spark and flink fail too, btw [14:22:00] <_joe_> probably [14:22:14] <_joe_> anyways, I already intended to boot the spark images off to their own space [14:22:27] <_joe_> as they're too big and take too long to build [14:22:41] <_joe_> and we need three different versions of them [14:23:10] there are different issues, I think I found the flink one as well [15:54:12] Happy to make/work a subtask for flink operator if it would make life easier [15:58:34] I've been investigation the spark builds and they seem to be caused by a bug in the openjdk-8-jre-headless package. https://phabricator.wikimedia.org/T358866 [15:59:49] I was wondering about reporting this upstream to Debian. [16:14:34] dcausse thanks for the info, I'll get a ticket started with that...btullis also found a different issue w/the JDK in T358866 [17:18:26] Created T358879 for the flink operator image build