[14:53:43] Hello, we're getting quite a few alerts from the SystemdUnitFailed alert to data-engineering about people's jupyternotebook servers on stats boxes. [14:53:50] e.g. (SystemdUnitFailed) firing: (8) jupyter-aitolkyn-singleuser-conda-analytics.service Failed on stat1005:9100 [14:54:58] I can create a ticket and take a look at excluding these myself from https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-sre/systemd.yaml but I thought I'd mention it here as well. [14:56:23] hey btullis, do you know what these are about? i.e. are the failures actual signal of sth that's wrong and needs intervention? [14:58:22] I'm not familiar how these units come about (also poking on stat1005 out of curiosity) [15:00:40] Yes, it's related to the `jupyterhub-conda.service` that runs on each of the stats boxes. This is like a parent service and we *would* want to know if this service went down. [15:02:19] Users connect with an SSH tunnel and then they can each start a `jupyter-$username-singleuser.service` - but only one per host. It's a transient systemd service, so there is no corresponding unit file in `/lib/systemd/system` [15:04:30] got it, and once the unit fails is there something to be investigated? as in is it useful at all that the units stay in 'systemctl list-units --failed' ? [15:04:31] btullis: is the event over or ongoing? often the systemd state is only broken becuase like once in the past things failed and "systemctl reset-failed" will clear it [15:05:38] also happy to take this on a task, trying to understand the problem better [15:06:02] alerts like this on stat* hosts have been going on for a long time, but the service names change [15:06:26] often because one user uses all the resources [15:06:39] Thanks, I'm more than happy to help share knowledge and the reduce the unnecessary noise. [15:07:56] I dont think ignoring them is the best solution here. Maybe more something like limiting users in how many resources they can use on these shared hosts. my 2 cents [15:08:26] btullis: SGTM! thank you [15:08:30] In general, I would say that it won't need investigating. It will generally only be of interest to the user themselves. [15:09:12] I've just checked and there is a *clean* way to stop one's own jupyterhub-singleuser service and this does indeed remove the transient unit altogether. [15:09:25] since it's on-topic: [15:09:25] 15:08 <+icinga-wm> PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service [15:09:33] rsync in this case [15:09:39] might be that random things are killed by OOM ? [15:09:49] that would match what I have seen on stat* hosts in the past sometimes [15:10:05] A failed jupyterhub-$users-singleuser unit might be killed by an OOM, yes. [15:11:20] Ok, that rsync-published alert probably *should* continue to be sent to the data-engineering team for investigation. There was some work done recently about publishing from HDFS and that generated some transient alerts like that while the work was being done. [15:12:15] oh, are you getting notifications from icinga sent to you? [15:13:26] Yes, plenty :-) Some are desirable for the data engineers on the team, some are only really relevant for the SREs on the team (like me). [15:14:28] interesting, I don't [15:14:29] It's going to get even trickier, because I'm getting reorganised into a 'data platforms SRE' team, so moved outside of data engineering, but that's another story. [15:14:45] I always thought it stopped for everyone [15:15:02] sure that's icinga and not alertmanager custom check? [15:15:13] interesting use case yeah, btullis would you mind following up on a task with #observability-alerting ? I'm definitely on board with reducing alert noise [15:15:35] mutante: Oh, hang on then, let me check the last time I received one. [15:15:45] godog: Yes, I certainly will. [15:16:08] SystemdUnitFailed alerts are alertmanager-based, and generic in nature [15:16:26] btullis: cheers! [15:19:59] OK, the kind of rsync-published emails that I got were stderr failures from the systemd timers themselves, so you're right. Not Icinga. Examples here: https://lists.wikimedia.org/hyperkitty/search?mlist=data-engineering-alerts%40lists.wikimedia.org&q=rsync [15:24:14] indeed, the alert name gave it away for me, we're using camel case for alertmanager alert names [15:24:43] "This error could also be triggered by the wrong permission set during a manual operation. " "my fault, I set permissions improperly on the manual output I copied" [15:25:08] well, with manual things done by users.. it's probably always going to have some level of noise [15:25:22] that's kind of the thing with the stat* [15:26:01] in most other cases we don't have all this manual stuff going on [15:27:53] btullis: I have to go shortly and I'll followup on phab (will read backscroll here too) [16:23:23] Create the following ticket: https://phabricator.wikimedia.org/T336951 [16:23:34] *Created