[01:08:03] (03PS1) 10Neil Shah-Quinn (WMF): movement_metrics: Add Wikifunctions to queried database groups [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/959367 (https://phabricator.wikimedia.org/T346966) [02:27:42] (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:27:42] (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:34] The kafka-jumbo rolling restart finished yesterday. Each broker is now fully working and UDP (under replicated partitions) count is down to 0 [08:08:11] 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) [08:10:08] !log redeploying eventgate-analytics in staging T336041 [08:10:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:10:13] T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 [08:16:27] brouberol: o/ one suggestion - when you deploy services like eventgate etc.. (that are on wikikube), drop a line in #wikimedia-serviceops [08:16:45] it is not necessary but stuff like eventgate-main may impact job queues etc.. [08:16:51] so they know basically [08:24:27] noted thanks! [08:32:24] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Data Products: Windows 11 missing in analytics ? - https://phabricator.wikimedia.org/T346890 (10TheDJ) >>! In T346890#9184336, @Mayakp.wiki wrote: > This feels like an effect of Chrome's UA reduction where in Phase 5, the device OS was replaced. See Rollout d... [08:56:00] !log Rerun edit-hourly druid indexation to fix corrupted data file [08:56:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:29:40] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [09:39:40] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [09:43:23] 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) [09:45:57] 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) 05Open→03Resolved [10:09:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:42] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:16:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:42] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:26:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:42] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:32:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:42] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:55:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:42] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:42] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:58:30] moritzm: as part of https://phabricator.wikimedia.org/T346763, I find myself in need of packaging external opensource tooling as a debian package, so we can use it in routine kafka operations. Would you have time in the coming days to share knowledge on debian packaging to a neophyte? Thank you! [12:05:03] * brouberol is afk for about 1h [12:11:00] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "I got a tentative +1 from Ben in slack, so that's good enough to try and deploy this. I'm going to roll back if something breaks. (Jenki" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/958945 (https://phabricator.wikimedia.org/T342213) (owner: 10Milimetric) [12:12:35] brouberol: yeah, sure thing. we can look into this next week [12:16:43] (03PS1) 10Milimetric: Update aqs to 69ded27 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/959727 [12:17:01] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update aqs to 69ded27 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/959727 (owner: 10Milimetric) [12:23:06] (03PS13) 10Btullis: Update to Superset version 2.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [12:39:43] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10gmodena) > I'll leave it to @gmodena to test that the settings have been satifactorily applied before re... [12:51:23] moritzm: thanks! [12:53:36] (03CR) 10Milimetric: "This looks great, tried some different spark configs, and this one seems to be the winner:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) (owner: 10Ladsgroup) [12:58:52] (03PS4) 10Ladsgroup: Introduce MostTranscludedPages.hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) [13:12:24] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineer... [13:42:13] PROBLEM - Check systemd state on archiva1002 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:42] (SystemdUnitFailed) firing: (2) nginx.service Failed on archiva1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:43:17] PROBLEM - HTTPS on archiva1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Archiva [13:49:07] RECOVERY - Check systemd state on archiva1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:11] RECOVERY - HTTPS on archiva1002 is OK: SSL OK - Certificate archiva.wikimedia.org valid until 2023-11-29 22:21:23 +0000 (expires in 69 days) https://wikitech.wikimedia.org/wiki/Analytics/Systems/Archiva [13:51:14] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, and 2 others: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10gmodena) [13:52:42] (SystemdUnitFailed) firing: (2) nginx.service Failed on archiva1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:12] 10Data-Engineering, 10Structured-Data-Backlog: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10mfossati) @xcollazo , I deleted those old `VariableProperties`. [14:36:29] 10Data-Engineering, 10Structured-Data-Backlog: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10mfossati) > The DAGs usually start on Thursdays: I'll verify their outputs before closing this bug. Well, seems like they've not s... [14:40:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:42] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:45:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:42] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:01:36] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Thank you Sam, for the awesome script and effort!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/755724 (owner: 10Awight) [15:02:36] !log deployed aqs 1.0 to enable etags on all endpoints - so far everything looks ok [15:02:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:12:08] PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:20] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:12:42] (SystemdUnitFailed) firing: (2) hadoop-yarn-nodemanager.service Failed on an-worker1118:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:15:53] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Data Products: Windows 11 missing in analytics ? - https://phabricator.wikimedia.org/T346890 (10Milimetric) Ok, so the action here would be to label the data better, and add an annotation for Phase 5 and any other big changes. [15:16:25] Looking at the logs on an-worker1118, that was caused by a shortage of heapspace [15:16:25] 2023-09-21 15:09:31,751 WARN org.sparkproject.io.netty.channel.AbstractChannelHandlerContext: An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception: [15:16:25] java.lang.OutOfMemoryError: Java heap space [15:16:50] RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:02] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:17:11] ^ I've restarted the process. I wonder why systemd didn't do it istelf [15:17:55] ah well, `Restart=no` in the systemd service config [15:17:56] I don't think that systemd is set to restart it, but puppet would probably have done so within 30 minutes. [15:18:06] (03CR) 10Mforns: [V: 03+2] Remove queries for deprecated mobile_apps jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/931959 (https://phabricator.wikimedia.org/T329310) (owner: 10Mforns) [15:19:09] If you're on the host, you could look at why it might have failed. I suspect oom-killer but have also seen segfault in the past. [15:19:59] https://usercontent.irccloud-cdn.com/file/p6rblSOj/image.png [15:20:02] https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=an-worker1118&var-datasource=thanos&var-cluster=analytics [15:20:32] looking at the logs, I found a java heap space error (cf my previous message) [15:22:42] (SystemdUnitFailed) firing: (2) hadoop-yarn-nodemanager.service Failed on an-worker1118:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:25:50] Just a guess at this stage, but I suspect this job, which is running with 58% of the Hadoop cluster resources. https://yarn.wikimedia.org/cluster/app/application_1694521537759_47834 [15:26:18] 10Data-Engineering: NEW BUG REPORT Some DAG run attempts fail because File *_temporary/0 does not exist. - https://phabricator.wikimedia.org/T347076 (10mpopov) [15:26:45] correction: 58% of queue, 20% of cluster resources. [15:28:21] (03CR) 10Mforns: [C: 03+1] "Is Cassandra still loading both clusters?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/681682 (https://phabricator.wikimedia.org/T280649) (owner: 10Joal) [15:30:06] 10Data-Engineering: NEW BUG REPORT Some DAG run attempts fail because File *_temporary/0 does not exist. - https://phabricator.wikimedia.org/T347076 (10mpopov) @Milimetric and I have a hypothesis that what's happening here is a race condition where the multiple concurrent runs of a DAG are all using the same tem... [16:00:52] Is anyone is up for a bit of python/kafka review, https://gerrit.wikimedia.org/r/c/operations/puppet/+/959162 and https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/959720 work together to improve both the runtime and reliability of kafka rolling restarts, by making sure the broker we just restarted is back in full sync before proceeding to [16:00:52] the next one. Thank you ! [16:23:15] 10Data-Engineering, 10Data Products, 10Wikidata, 10Wikidata-Query-Service: Publish WDQS JNL files to dumps.wikimedia.org - https://phabricator.wikimedia.org/T344905 (10VirginiaPoundstone) [16:38:31] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) I've tried two more builds, but I'm still finding the same issue. I t... [16:44:42] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) Ahah! The different version comes from https://repo.anaconda.com/pkgs... [16:44:44] (03CR) 10Joal: Cleanup cassandra double loading (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/681682 (https://phabricator.wikimedia.org/T280649) (owner: 10Joal) [17:12:36] 10Data-Platform-SRE, 10observability, 10Epic: Review alerting around Search update pipeline - https://phabricator.wikimedia.org/T346807 (10bking) Thanks Andrea and Leo! I'm closing this one in favor of T346438 , but will subscribe y'all on that ticket. [17:12:51] 10Data-Platform-SRE, 10observability, 10Epic: Review alerting around Search update pipeline - https://phabricator.wikimedia.org/T346807 (10bking) 05Open→03Declined [17:12:54] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10bking) [17:44:41] (03CR) 10Mforns: [C: 03+2] Remove unused cassandra module [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/940154 (owner: 10Joal) [17:59:38] !log Deploy latest DAGs to analytics Airflow instance [17:59:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:13:57] (03CR) 10Hghani: "+1 from me" [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/959367 (https://phabricator.wikimedia.org/T346966) (owner: 10Neil Shah-Quinn (WMF)) [18:19:31] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [18:54:37] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10VRiley-WMF) an-master1003 - C 6. U 12. port 09 CableID 3193 an-master1004 - D 8. U 36. port 35 CableID 2013339101850 [19:22:42] (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:22:58] (03CR) 10Hghani: [V: 03+2 C: 03+2] "Looks good to me" [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/959367 (https://phabricator.wikimedia.org/T346966) (owner: 10Neil Shah-Quinn (WMF)) [19:33:14] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10xcollazo) Interesting! One observation: We lock down the `conda-environment.... [19:54:52] 10Data-Platform-SRE: Write new partman recipe for cloudelastic (jbod) and update relevant Elastic config - https://phabricator.wikimedia.org/T342463 (10bking) [19:55:29] 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work): Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10bking) a:03bking [19:59:23] 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10bking) [20:04:26] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [21:00:58] 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10bking) [21:14:55] 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10bking) [22:33:20] 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Movement-Insights, 10WMDE-FUN-Sprint-2023-09-04: Unique Devices seasonal trends on small projects - https://phabricator.wikimedia.org/T344381 (10Mayakp.wiki) Thanks for confirming @kai.nissen ! I checked our dashboards an... [23:22:42] (SystemdUnitFailed) firing: cleanup_tmpdumps.service Failed on dumpsdata1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:58:29] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2): Document data pipeline and data set ownership - https://phabricator.wikimedia.org/T346295 (10Ahoelzl) Next step, define temporary ownership of DE pipelines to meet DQ goals and develop platform.