[00:18:37] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-kerberos@research.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:15] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-kerberos@research.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:13] 10Data-Engineering, 10Growth-Team, 10MediaWiki-Recent-changes, 10Pywikibot, 10Event-Platform: Truncated JSON data in recent change event stream "/mediawiki/recentchange/1.0.0" - https://phabricator.wikimedia.org/T353855 (10Mmnormyle) [01:42:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-kerberos@research.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [05:44:58] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-kerberos@research.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:44:46] * brouberol waves good morning! [08:46:03] (03PS3) 10Aqu: Remove Oozie folder [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983674 (https://phabricator.wikimedia.org/T336739) [09:02:45] (03PS4) 10Aqu: Remove Oozie folder [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983674 (https://phabricator.wikimedia.org/T336739) [09:02:51] 10Data-Engineering, 10Data Pipelines, 10Epic, 10Patch-For-Review: Post Oozie -> Airflow migration refactorings - https://phabricator.wikimedia.org/T336739 (10CodeReviewBot) aqu opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/570 Move swift_upload.py in refinery [09:03:29] (03PS5) 10Aqu: Remove Oozie folder [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983674 (https://phabricator.wikimedia.org/T336739) [09:19:58] Good morning. Happy Solstice everyone. [09:26:54] 10Data-Engineering (Sprint 6): [Data Quality] Finalize Data Quality Metrics Schema - https://phabricator.wikimedia.org/T352683 (10gmodena) a:03gmodena [09:31:10] 10Data-Engineering (Sprint 6): [Data Quality] Finalize Data Quality Metrics Schema - https://phabricator.wikimedia.org/T352683 (10gmodena) We figured out and documented a partitioning scheme based on source table partition timestamp. The schema (and DQ approach) is documented in google doc, and can be moved to w... [09:38:16] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10Gehel) a:03brouberol [09:44:58] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-kerberos@research.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:25:35] (03CR) 10Gmodena: refinery-job: add WebrequestMetrics. (038 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) (owner: 10Gmodena) [10:29:42] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10elukey) +1, looks good (IIUC the new estimation are similar from the original ballpark figures, if not... [11:40:00] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10brouberol) I've taken a quick stab at creating a chart... [11:55:18] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/23 Add /app to the PYTH... [12:03:37] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/23 Add /app to the PYTH... [12:24:19] (03PS25) 10Gmodena: refinery-job: add WebrequestMetrics. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) [12:29:18] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1002 for hosts: `wdqs1006.eqiad.wmnet` - wdqs1006.eqiad.wmnet (**FAIL**) - //Unable to find/resolve... [12:31:30] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10Volans) @RKemper so the original failure was due to the fact that homer was not yet setup on the new `cumin1002` host and this cookbook was actually requiring a fully functional h... [13:21:13] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10JMeybohm) Cool. I think we could/should deploy this vi... [13:32:29] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10brouberol) Good call, I didn't think of that. Would yo... [13:36:10] (03CR) 10Sergio Gimeno: Add analytics for Impressions, Success and Abandonment rate for temporary Users (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [13:44:58] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-kerberos@research.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:54:09] (03CR) 10Sergio Gimeno: "How is this restricted to temporary users? Rather we need to update the code in I51be116eab6a49c968a44e03a54e40d2a2f9550a to only instrume" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [13:54:13] (03CR) 10Sergio Gimeno: [C: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [13:58:57] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10brouberol) A thought: do we want to enable egress to s... [13:59:45] (03CR) 10Ottomata: refinery-job: add WebrequestMetrics. (034 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) (owner: 10Gmodena) [14:07:10] 10Data-Platform-SRE: Service implementation for elastic2087-2100 - https://phabricator.wikimedia.org/T353878 (10bking) [14:08:20] 10Data-Platform-SRE: Service implementation for elastic2087-2100 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=62754cbc-a6d2-4962-8a4a-e2ae09cc8a2c) set by bking@cumin2002 for 18 days, 0:00:00 on 13 host(s) and their services with reason: T352878... [14:09:00] 10Data-Platform-SRE: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10bking) [14:09:42] 10Data-Platform-SRE: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f2d76d59-0ff4-4dca-beb1-17a78d59347b) set by bking@cumin2002 for 18 days, 0:00:00 on 10 host(s) and their services with reason: T352878... [14:21:35] 10Data-Engineering (Sprint 8), 10Data Products, 10serviceops-radar: Use config-master.wikimedia.org/mediawiki.yaml to automatically switch code that depends on active datacenter - https://phabricator.wikimedia.org/T338796 (10Ottomata) @xcollazo do we need this anymore now that we've enabled canary events for... [14:34:25] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10JMeybohm) We want charts to explicitly define the serv... [14:36:54] 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10gmodena) We are using airflow for dq job orchestration. The actual implementation is in {T352685} and {T3... [15:02:22] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10brouberol) Alright! I just thought I'd asked. [15:36:12] !log creating superset and superset-next namespace on dse-k8s for T347710 [15:36:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:36:15] T347710: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 [15:40:00] 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables - https://phabricator.wikimedia.org/T347879 (10Antoine_Quhen) In this ticket, I needed to find a way to detect when an Iceberg table has some data in it. This would replace the Hiv... [15:41:23] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10Ottomata) How should we layout and name the new stream(s)? Currently, we have `webrequest_text` and `webrequest_upload` topics. Which topic... [15:41:54] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10Ottomata) @aqu asked if we should consider making the new webrequest Hive table an Iceberg table. @JAllemandou @xcollazo can/should we do t... [15:44:06] 10Data-Engineering, 10Data-Platform-SRE, 10Epic, 10Patch-For-Review: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10BTullis) [15:45:10] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10Ottomata) [15:46:11] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Airflow scheduler monitoring is broken since the most recent deploy - https://phabricator.wikimedia.org/T353806 (10BTullis) 05Open→03Resolved This has now been fixed. The reason for it was that the research instance was set to deploy from a... [15:52:18] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10Ottomata) [15:54:43] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1002 for hosts: `wdqs1007.eqiad.wmnet` - wdqs1007.eqiad.wmnet (**FAIL**) - //Missing DNSName in Nebo... [15:59:10] 10Data-Engineering, 10MediaWiki-General, 10Event-Platform: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 (10Ottomata) [16:00:40] 10Data-Engineering, 10MediaWiki-General, 10Event-Platform: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 (10Ottomata) After a discussion in Slack, I have changed the suggested implementation to be use medaiwiki-config/do... [16:01:22] 10Data-Engineering, 10MediaWiki-General, 10Event-Platform: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 (10Ottomata) [16:01:40] 10Data-Engineering, 10MediaWiki-General, 10Event-Platform: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 (10Ottomata) [16:06:20] 10Data-Engineering, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10BTullis) >>! In T351117#9419578, @Ottomata wrote: > To do this migration plan ^, we'd need Kafka jumbo to support 2x webrequest volume while... [16:10:10] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1002 for hosts: `wdqs1008.eqiad.wmnet` - wdqs1008.eqiad.wmnet (**FAIL**) - //Missing DNSName in Nebo... [16:11:47] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10Fabfur) >>! In T314956#9421549, @Ottomata wrote: > How should we layout and name the new stream(s)? > > Currently, we have `webrequest_text... [16:29:13] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Redesign Data Platform docs on Wikitech - https://phabricator.wikimedia.org/T350911 (10TBurmeister) Status update: work started on T350914. More details there! [16:30:24] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10Ottomata) As for Hive tables. I'm trying to decide how best to do the migration. Perhaps, it would be easiest to keep the existent `wmf.we... [16:32:16] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10Ottomata) [16:33:47] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10Ottomata) Hm, alternatively, we could just have the raw and refined tables be brand newly named tables and ingestion jobs during the migrati... [16:36:26] 10Data-Engineering, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10brouberol) The cluster has enough capacity to accommodate 2x webrequest volume, so 👍 on my end. When we stop writing to the original `webrequ... [16:36:47] (03PS1) 10Bearloga: content_translation_event: Add more event_source values [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/984864 (https://phabricator.wikimedia.org/T353615) [16:37:49] 10Data-Engineering, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10BTullis) >>! In T351117#9406760, @Ottomata wrote: > SRE has been working on a [[ https://docs.google.com/document/d/13oZf2aWAUyCtwscAx1PVY3nx... [16:40:16] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10brouberol) > I'd prefer to have something to review an... [16:40:31] 10Data-Engineering, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Ottomata) > Do you have an estimate of the duration for which we'd be dual-writing? I think we hope to get this done in Q3. But YMMV ¯\_(ツ)... [17:01:10] 10Data-Engineering, 10Structured-Data-Backlog (Current Work): NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10MarkTraceur) 05Open→03Resolved [17:18:50] 10Data-Engineering (Sprint 8), 10Data Products, 10serviceops-radar: Use config-master.wikimedia.org/mediawiki.yaml to automatically switch code that depends on active datacenter - https://phabricator.wikimedia.org/T338796 (10xcollazo) >>! In T338796#9421258, @Ottomata wrote: > @xcollazo do we need this anymo... [17:22:57] 10Data-Engineering (Sprint 8), 10Data Products, 10serviceops-radar: Rewrite all Airflow sensors that use datacenter prepartitions to depend on both datacenters - https://phabricator.wikimedia.org/T338796 (10xcollazo) [17:26:46] 10Data-Engineering, 10MediaWiki-General, 10Event-Platform: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 (10Ottomata) [17:44:58] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-kerberos@research.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:33:02] RECOVERY - Check systemd state on an-airflow1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-kerberos@research.service Failed on an-airflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:52:28] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [21:49:47] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10xcollazo) >>! In T314956#9421992, @Ottomata wrote: > We just had a discussion in DE standup about {T335306}. I'm sure there are many existen... [21:52:29] 10Data-Engineering, 10Data Products (Data Products Sprint 05): Make defaults immutable for Airflow confs - https://phabricator.wikimedia.org/T325014 (10CodeReviewBot) xcollazo updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/572 Draft: Make global confs immutable. [22:07:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [22:34:43] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed