[02:24:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [03:04:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:28:27] (03PS1) 10Btullis: Use an updated version of kafka for the datahub kafka-setup image [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935378 (https://phabricator.wikimedia.org/T329514) [08:30:03] (03PS2) 10Btullis: Use an updated version of kafka for the datahub kafka-setup image [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935378 (https://phabricator.wikimedia.org/T329514) [08:49:40] (03CR) 10CI reject: [V: 04-1] Use an updated version of kafka for the datahub kafka-setup image [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935378 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [08:52:38] a-team: I am planning to do a rolling restart of the Hadoop workers today, to pick up a new Java version. I'm covering Ops Week, so I will be on the lookout in case any jobs need to be re-run. [08:56:30] ack btullis - thanks for the heads up [08:57:11] !log executing `cookbook sre.hadoop.roll-restart-workers analytics` [08:57:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:29:13] The cookbook failed because of the hosts that are currently excluded from yarn due to being decommissioned. This caused to high a failure rate when roll-restarting the yarn nodemanger service. [09:29:28] !log restarting the yarn restart with `sudo cumin -b 5 -p 80 -s 30 A:hadoop-worker 'systemctl restart hadoop-yarn-nodemanager'` [09:29:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:43:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [09:55:47] btullis: I'm rerunning the failed refine job (sothat you know :) [09:56:44] joal: Oh thank you! I wasn't keeping as close an eye as I should have been. Still doing too many things at once :-) [09:56:51] np :) [10:00:12] !log roll-restarting journal nodes with 30 seconds between each one: `sudo cumin -b 1 -p 100 -s 30 A:hadoop-hdfs-journal 'systemctl restart hadoop-hdfs-journalnode'` [10:00:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:03:25] 10Data-Platform-SRE, 10Discovery-Search: Unmanaged envoyproxy installation on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T341042 (10JMeybohm) [10:05:59] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:13] PROBLEM - Hadoop Namenode - Primary on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [10:06:25] Uh oh, this looks bad. [10:06:46] (SystemdUnitFailed) firing: hadoop-hdfs-namenode.service Failed on an-master1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:07:19] https://www.irccloud.com/pastebin/gSSbcgmB/ [10:07:33] Namenode service has failed over to an-master1002, which is good. [10:07:37] btullis: could it be a the same failover issue we saw a while back, when we're trying to fail over when too much is done on the sstem? [10:08:24] I wasn't trying to do a failover. I was doing a rolling-restart of the journalnodes. The same thing the cookbook does, but with a cumin command. [10:09:51] https://www.irccloud.com/pastebin/uLp7HFZd/ [10:09:59] Definitely related to what I was doing. [10:10:15] ack! [10:10:32] !log btullis@an-master1001:~$ sudo systemctl start hadoop-hdfs-namenode [10:10:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:10:37] RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:49] RECOVERY - Hadoop Namenode - Primary on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [10:12:35] The namenode appears to have started again successfully, so that's good. It looks like I should have given each journalnode more time to settle before roll-restarting the next. [10:13:07] I think I have them the same amount of time that they get by default from the cookbook, but I'll double-check. [10:16:46] (SystemdUnitFailed) resolved: hadoop-hdfs-namenode.service Failed on an-master1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:56] Ah, no it was totally my fault. I mis-read the cookbook. The cookbook gives 120 seconds for each journalnode to settle before restarting the next. I took to default 30 second parameter from the yarn resourcemanager restarts. [10:31:59] !log beginning hdfs datanode rolling restart with `sudo cumin -b 2 -p 80 -s 120 A:hadoop-worker 'systemctl restart hadoop-hdfs-datanode'` [10:32:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:33:22] Actually, I could have used the default `-p 100` there for a threshold of 100% success. I had to use 80% on yarn because of the decommissioning nodes. [11:01:35] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) Unfortunately, we have to update again because of a new vulnerability announced in the hive-connector. Mentioned in: T336244 We can either stick at 2.6.1 and just regenerate the co... [11:22:30] (03CR) 10Btullis: "recheck" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935378 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:32:52] 60% of the way through the Hadoop datanode restart - 53 of 87 hosts restarted. [11:52:50] (03CR) 10Btullis: [C: 03+2] Use an updated version of kafka for the datahub kafka-setup image [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935378 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:15:53] 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Stevemunene) I succeeded in creating a jupyterhub single user session. The error was due to a missing directory namely the `/srv/spark-tmp` directory which is part of the [[ https://github.com/wikimedia/operations-pup... [12:16:00] (03Merged) 10jenkins-bot: Use an updated version of kafka for the datahub kafka-setup image [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935378 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:26:47] The hadoop workers rolling restart has finished. [12:38:47] 10Data-Engineering: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Manuel) @AndrewTavis_WMDE: Yes, going with only `wmde` sounds good to me. Let's make it so! [12:44:09] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/450 Update airflow to version 2.6.2 [12:51:46] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:18] ^ looking [13:00:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:46] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:46] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:13] btullis: Heya-would ou have a minute for me?I'd like to talk about the Spark history-server with you [13:13:50] joal. Yes, hang on two ticks... Just pushing something for tomorrow. [13:13:57] sure - let me know [13:16:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:57] joal: OK, all done. To the cave? [13:18:02] OMW! [13:21:46] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:28:05] 10Data-Platform-SRE, 10Patch-For-Review: Deploy spark history - https://phabricator.wikimedia.org/T330176 (10JAllemandou) [13:35:25] Puppet patcheslook good btullis :) All set for tomorrow :) [13:37:07] 10Data-Platform-SRE, 10Patch-For-Review: Deploy spark history - https://phabricator.wikimedia.org/T330176 (10BTullis) p:05Low→03High I have been discussing this with @JAllemandou and I can now see how useful this feature would be. We have had the mapreduce history server available, but now that the vast ma... [13:37:33] joal: Many thanks. [13:39:17] 10Data-Platform-SRE: Set up Spark SQL Server - https://phabricator.wikimedia.org/T324017 (10BTullis) p:05High→03Low Having discussed this with @JAllemandou, we have decided to lower the priority of this ticket but not decline it. There may still be some benefit to the Spark SQL server, but now that the prest... [13:40:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:46] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:43:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:43:46] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.0 - https://phabricator.wikimedia.org/T329514 (10BTullis) The kafka-setup container is now proceeding pretty well, with a different error now. ` btullis@deploy1002:~$ kubectl logs -f datahub-main-kafka-setup-job-5pqtb Error: Could not find or l... [13:46:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:46] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:46] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:53:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:37] !log roll-restarting the eventgate-analytics-external worker pods in eqiad with: `helmfile -e eqiad --state-values-set roll_restart=1 sync` [13:55:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:01:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:46] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:46] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:19] (03PS1) 10Jennifer Ebe: T335860 Implement job to transform mediawiki revision_visibility_change Hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/935442 [14:27:41] It looks like the restart of the pods didn't help with the canary job. [14:28:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:30:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:46] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:38:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:46] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:46] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:55:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:46] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:01:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:46] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:46] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:15:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:46] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:46] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:30:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:46] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:56:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:46] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:01:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:46] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:46] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:46] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed