[10:35:49] \o I have a question about https://gerrit.wikimedia.org/r/c/operations/alerts/+/984219 (adding OOM kill and Memlimit alerts to k8s containers) [10:36:28] Am I understanding it correctly that the deply-tag in team-sre/kubernetes-prod.yaml indicates that these alerts would already fire for ML/LiftWing containers should the conditions be right? [10:46:11] It doesn't seem like that tag works the way I suspected it to, since our prom instance (config dir prometheus2005:/srv/prometheus/k8s-mlserve has no mention of "oom" (case insensitive) [10:50:19] Ah, nvm, they live in /srv/alerts [10:54:26] So the question is: we'd like that alert to be crit not warn. What is the general feeling about doing that? Otherwise, we (ML) could keep a copy, but that's asking for drift/duplication. [11:12:55] You can do a yaml reference duplication and overload the expression to contain externalLabels.prometheus=k8s-mlserve I think [11:13:01] You may want to ask o11y [11:59:59] why though ? [12:00:18] why would you want this alert to become critical, that is [12:27:20] because warnings get lost? [12:27:45] AIUI, warnings don't trigger IRC notifications, for example [12:32:11] I'm not dead set on making it a crit alert for everyone, it's just an idea [14:48:22] critical though should ideally mean taking action pretty quickly to resolve the alert (again, ideially). If that is not the intent (and it was not when we created that alert) then critical isn't a good target for this. E.g. what action would you take if that alert shows up as critical? There isn't much you can do aside from bumping memory limits or [14:48:22] restarting the pod. And if you have a suspected memory leak, neither is sufficient. You end up having the alert (maybe ACKed) for a long amount of time. Which causes alert fatigue, training everyone to ignore the alert (and arguably alerts in general - which admittedly has happened already here). [14:48:53] Ideally, warning should be calling some trigger, filing a task and then resolve the warning in a way that is permanent as long as the task is open. [14:50:36] but while we aren't there, at least the "warning" part doesn't point out that quick action is required. [14:54:50] The alert that i was editing initially has a limit of 95%, so closer to the limit, and our services have pretty flat steady states when it comes to memory use. So I guess our alert was more about catching something creeping towards OOM danger, but is now close enough that action needs to be taken on the order of days, not weeks. [14:56:47] Maybe a staggered alert set (90% warn, 95% crit) would be an option [15:00:42] I suppose it depends a lot on what the workloads are. In the WikiKube case we don't fuss much if OOM is going to show up, cause a) we got enough replicas to not care much for a specific instance of a service getting OOMed b) almost all services have a very fast start up time so seeing something getting OOMed and restarted isn't too much of a [15:00:42] nuisance as far as the impact to the service goes [15:01:29] The original intent behind those alerts was to actually catch system-y components like calico-node, calico-typha, eventrouter et al [15:01:44] being very close to their limits and being routinely restarted [15:02:36] we fixed some of those and then we added the rest of the services to get an idea of what is going on and we do have 2 things that stand out and are in the limbo I mentioned above [15:02:56] linkrecommendation and eventstreams are routinely close to their limits and bumping them didn't help [15:03:23] but not as a service ofc, just a specific instance (or 2) [15:04:01] But to get back to your point. If you expected reaction time is days, not minutes/hours, I 'd argue it's a warning. [15:04:48] put differently, if you can go to sleep, wake up, find a task that says "container X is close to the memory limit" and fix it at your own leisure, it's not a critical alert. [16:43:33] Mh, I see your points. I will have a think about it.