[14:53:43] <btullis>	 Hello, we're getting quite a few alerts from the SystemdUnitFailed alert to data-engineering about people's jupyternotebook servers on stats boxes.
[14:53:50] <btullis>	 e.g. (SystemdUnitFailed) firing: (8) jupyter-aitolkyn-singleuser-conda-analytics.service Failed on stat1005:9100
[14:54:58] <btullis>	 I can create a ticket and take a look at excluding these myself from https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-sre/systemd.yaml but I thought I'd mention it here as well.
[14:56:23] <godog>	 hey btullis, do you know what these are about? i.e. are the failures actual signal of sth that's wrong and needs intervention?
[14:58:22] <godog>	 I'm not familiar how these units come about (also poking on stat1005 out of curiosity)
[15:00:40] <btullis>	 Yes, it's related to the `jupyterhub-conda.service` that runs on each of the stats boxes. This is like a parent service and we *would* want to know if this service went down. 
[15:02:19] <btullis>	 Users connect with an SSH tunnel and then  they can each start a `jupyter-$username-singleuser.service` - but only one per host. It's a transient systemd service, so there is no corresponding unit file in `/lib/systemd/system`
[15:04:30] <godog>	 got it, and once the unit fails is there something to be investigated? as in is it useful at all that the units stay in 'systemctl list-units --failed' ?
[15:04:31] <mutante>	 btullis: is the event over or ongoing? often the systemd state is only broken becuase like once in the past things failed and "systemctl reset-failed" will clear it
[15:05:38] <godog>	 also happy to take this on a task, trying to understand the problem better
[15:06:02] <mutante>	 alerts like this on stat* hosts have been going on for a long time, but the service names change
[15:06:26] <mutante>	 often because one user uses all the resources
[15:06:39] <btullis>	 Thanks, I'm more than happy to help share knowledge and the reduce the unnecessary noise.
[15:07:56] <mutante>	 I dont think ignoring them is the best solution here. Maybe more something like limiting users in how many resources they can use on these shared hosts.  my 2 cents
[15:08:26] <godog>	 btullis: SGTM! thank you
[15:08:30] <btullis>	 In general, I would say that it won't need investigating. It will generally only be of interest to the user themselves.
[15:09:12] <btullis>	 I've just checked and there is a *clean* way to stop one's own jupyterhub-singleuser service and this does indeed remove the transient unit altogether.
[15:09:25] <mutante>	 since it's on-topic:
[15:09:25] <mutante>	 15:08 <+icinga-wm> PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service 
[15:09:33] <mutante>	 rsync in this case
[15:09:39] <mutante>	 might be that random things are killed by OOM ?
[15:09:49] <mutante>	 that would match what I have seen on stat* hosts in the past sometimes
[15:10:05] <btullis>	 A failed jupyterhub-$users-singleuser unit might be killed by an OOM, yes. 
[15:11:20] <btullis>	 Ok, that rsync-published alert probably *should* continue to be sent to the data-engineering team for investigation. There was some work done recently about publishing from HDFS and that generated some transient alerts like that while the work was being done.
[15:12:15] <mutante>	 oh, are you getting notifications from icinga sent to you?
[15:13:26] <btullis>	 Yes, plenty :-) Some are desirable for the data engineers on the team, some are only really relevant for the SREs on the team (like me).
[15:14:28] <mutante>	 interesting, I don't
[15:14:29] <btullis>	 It's going to get even trickier, because I'm getting reorganised into a 'data platforms SRE' team, so moved outside of data engineering, but that's another story.
[15:14:45] <mutante>	 I always thought it stopped for everyone
[15:15:02] <mutante>	 sure that's icinga and not alertmanager custom check?
[15:15:13] <godog>	 interesting use case yeah, btullis would you mind following up on a task with #observability-alerting ? I'm definitely on board with reducing alert noise
[15:15:35] <btullis>	 mutante: Oh, hang on then, let me check the last time I received one.
[15:15:45] <btullis>	 godog: Yes, I certainly will.
[15:16:08] <godog>	 SystemdUnitFailed alerts are alertmanager-based, and generic in nature
[15:16:26] <godog>	 btullis: cheers!
[15:19:59] <btullis>	 OK, the kind of rsync-published emails that I got were stderr failures from the systemd timers themselves, so you're right. Not Icinga. Examples here: https://lists.wikimedia.org/hyperkitty/search?mlist=data-engineering-alerts%40lists.wikimedia.org&q=rsync
[15:24:14] <godog>	 indeed, the alert name gave it away for me, we're using camel case for alertmanager alert names
[15:24:43] <mutante>	 "This error could also be triggered by the wrong permission set during a manual operation. " "my fault, I set permissions improperly on the manual output I copied"
[15:25:08] <mutante>	 well, with manual things done by users.. it's probably always going to have some level of noise
[15:25:22] <mutante>	 that's kind of the thing with the stat*
[15:26:01] <mutante>	 in most other cases we don't have all this manual stuff going on
[15:27:53] <godog>	 btullis: I have to go shortly and I'll followup on phab (will read backscroll here too)
[16:23:23] <btullis>	 Create the following ticket: https://phabricator.wikimedia.org/T336951 
[16:23:34] <btullis>	 *Created