[02:24:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[03:04:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[08:28:27] <wikibugs>	 (03PS1) 10Btullis: Use an updated version of kafka for the datahub kafka-setup image [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935378 (https://phabricator.wikimedia.org/T329514)
[08:30:03] <wikibugs>	 (03PS2) 10Btullis: Use an updated version of kafka for the datahub kafka-setup image [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935378 (https://phabricator.wikimedia.org/T329514)
[08:49:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Use an updated version of kafka for the datahub kafka-setup image [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935378 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[08:52:38] <btullis>	 a-team: I am planning to do a rolling restart of the Hadoop workers today, to pick up a new Java version. I'm covering Ops Week, so I will be on the lookout in case any jobs need to be re-run.
[08:56:30] <joal>	 ack btullis - thanks for the heads up
[08:57:11] <btullis>	 !log executing `cookbook sre.hadoop.roll-restart-workers analytics`
[08:57:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:29:13] <btullis>	 The cookbook failed because of the hosts that are currently excluded from yarn due to being decommissioned. This caused to high a failure rate when roll-restarting the yarn nodemanger service.
[09:29:28] <btullis>	 !log restarting the yarn restart with `sudo cumin -b 5 -p 80 -s 30 A:hadoop-worker 'systemctl restart hadoop-yarn-nodemanager'`
[09:29:29] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:43:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[09:55:47] <joal>	 btullis: I'm rerunning the failed refine job (sothat you know :)
[09:56:44] <btullis>	 joal: Oh thank you! I wasn't keeping as close an eye as I should have been. Still doing too many things at once :-)
[09:56:51] <joal>	 np :)
[10:00:12] <btullis>	 !log roll-restarting journal nodes with 30 seconds between each one: `sudo cumin -b 1 -p 100 -s 30 A:hadoop-hdfs-journal 'systemctl restart hadoop-hdfs-journalnode'`
[10:00:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:03:25] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Unmanaged envoyproxy installation on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T341042 (10JMeybohm)
[10:05:59] <icinga-wm>	 PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:06:13] <icinga-wm>	 PROBLEM - Hadoop Namenode - Primary on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[10:06:25] <btullis>	 Uh oh, this looks bad. 
[10:06:46] <jinxer-wm>	 (SystemdUnitFailed) firing: hadoop-hdfs-namenode.service Failed on an-master1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:07:19] <btullis>	 https://www.irccloud.com/pastebin/gSSbcgmB/
[10:07:33] <btullis>	 Namenode service has failed over to an-master1002, which is good.
[10:07:37] <joal>	 btullis: could it be a the same failover issue we saw a while back, when we're trying to fail over when too much is done on the sstem?
[10:08:24] <btullis>	 I wasn't trying to do a failover. I was doing a rolling-restart of the journalnodes. The same thing the cookbook does, but with a cumin command.
[10:09:51] <btullis>	 https://www.irccloud.com/pastebin/uLp7HFZd/
[10:09:59] <btullis>	 Definitely related to what I was doing.
[10:10:15] <joal>	 ack!
[10:10:32] <btullis>	 !log btullis@an-master1001:~$ sudo systemctl start hadoop-hdfs-namenode
[10:10:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:10:37] <icinga-wm>	 RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:10:49] <icinga-wm>	 RECOVERY - Hadoop Namenode - Primary on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[10:12:35] <btullis>	 The namenode appears to have started again successfully, so that's good. It looks like I should have given each journalnode more time to settle before roll-restarting the next.
[10:13:07] <btullis>	 I think I have them the same amount of time that they get by default from the cookbook, but I'll double-check.
[10:16:46] <jinxer-wm>	 (SystemdUnitFailed) resolved: hadoop-hdfs-namenode.service Failed on an-master1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:27:56] <btullis>	 Ah, no it was totally my fault. I mis-read the cookbook. The cookbook gives 120 seconds for each journalnode to settle before restarting the next. I took to default 30 second parameter from the yarn resourcemanager restarts.
[10:31:59] <btullis>	 !log beginning hdfs datanode rolling restart with `sudo cumin -b 2 -p 80 -s 120 A:hadoop-worker 'systemctl restart hadoop-hdfs-datanode'`
[10:32:01] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:33:22] <btullis>	 Actually, I could have used the default `-p 100` there for a threshold of 100% success. I had to use 80% on yarn because of the decommissioning nodes.
[11:01:35] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) Unfortunately, we have to update again because of a new vulnerability announced in the hive-connector. Mentioned in: T336244  We can either stick at 2.6.1 and just regenerate the co...
[11:22:30] <wikibugs>	 (03CR) 10Btullis: "recheck" [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935378 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[11:32:52] <btullis>	 60% of the way through the Hadoop datanode restart - 53 of 87 hosts restarted.
[11:52:50] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Use an updated version of kafka for the datahub kafka-setup image [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935378 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[12:15:53] <wikibugs>	 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Stevemunene) I succeeded in creating a jupyterhub single user session. The error was due to a missing directory namely the `/srv/spark-tmp` directory which is part of the [[ https://github.com/wikimedia/operations-pup...
[12:16:00] <wikibugs>	 (03Merged) 10jenkins-bot: Use an updated version of kafka for the datahub kafka-setup image [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935378 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[12:26:47] <btullis>	 The hadoop workers rolling restart has finished.
[12:38:47] <wikibugs>	 10Data-Engineering: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Manuel) @AndrewTavis_WMDE: Yes, going with only `wmde` sounds good to me. Let's make it so!
[12:44:09] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/450  Update airflow to version 2.6.2
[12:51:46] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:52:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:53:18] <btullis>	 ^ looking
[13:00:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:01:46] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:06:46] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:07:31] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:13:13] <joal>	 btullis: Heya-would ou have a minute for me?I'd like to talk about the Spark history-server with you
[13:13:50] <btullis>	 joal. Yes, hang on two ticks... Just pushing something for tomorrow.
[13:13:57] <joal>	 sure - let me know
[13:16:23] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:17:57] <btullis>	 joal: OK, all done. To the cave?
[13:18:02] <joal>	 OMW!
[13:21:46] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:28:05] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy spark history - https://phabricator.wikimedia.org/T330176 (10JAllemandou)
[13:35:25] <joal>	 Puppet patcheslook good btullis :) All set for tomorrow :)
[13:37:07] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy spark history - https://phabricator.wikimedia.org/T330176 (10BTullis) p:05Low→03High I have been discussing this with @JAllemandou and I can now see how useful this feature would be. We have had the mapreduce history server available, but now that the vast ma...
[13:37:33] <btullis>	 joal: Many thanks.
[13:39:17] <wikibugs>	 10Data-Platform-SRE: Set up Spark SQL Server - https://phabricator.wikimedia.org/T324017 (10BTullis) p:05High→03Low Having discussed this with @JAllemandou, we have decided to lower the priority of this ticket but not decline it. There may still be some benefit to the Spark SQL server, but now that the prest...
[13:40:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:41:46] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:43:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[13:43:46] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.0 - https://phabricator.wikimedia.org/T329514 (10BTullis) The kafka-setup container is now proceeding pretty well, with a different error now. ` btullis@deploy1002:~$ kubectl logs -f datahub-main-kafka-setup-job-5pqtb  Error: Could not find or l...
[13:46:15] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:46:46] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:51:46] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:53:45] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:55:37] <btullis>	 !log roll-restarting the eventgate-analytics-external worker pods in eqiad with: `helmfile -e eqiad --state-values-set roll_restart=1 sync`
[13:55:38] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:01:29] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:01:46] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:21:46] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:22:41] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:25:19] <wikibugs>	 (03PS1) 10Jennifer Ebe: T335860 Implement job to transform mediawiki revision_visibility_change Hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/935442
[14:27:41] <btullis>	 It looks like the restart of the pods didn't help with the canary job.
[14:28:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[14:30:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:46] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:38:57] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:41:46] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:46:21] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:46:46] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:55:11] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:56:46] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:01:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:01:46] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:09:57] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:11:46] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:15:53] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:21:46] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:26:13] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:26:46] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:30:41] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:31:46] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:56:41] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:56:46] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:01:11] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:01:46] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:06:46] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:07:11] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:11] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:21:46] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed