[01:16:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:51:41] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [02:01:42] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [02:27:43] (SystemdUnitFailed) firing: (2) cleanup_tmpdumps.service Failed on dumpsdata1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:39:00] (SystemdUnitCrashLoop) firing: (5) crashloop on kafka-jumbo1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [05:54:11] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) Hello, while doing a cleanup of the WMDE scripts/jobs we came across a situation where some of the jobs timers are still running and seem to be doing so successfully. T... [06:28:00] (SystemdUnitFailed) firing: (2) cleanup_tmpdumps.service Failed on dumpsdata1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:43:42] hmm, seems like mirrormaker is at it again [06:44:58] ah, no, wait. It's about the mirrormakers we have _disabled_. I checked, and all others are running smoothly. False alarm [06:45:57] Ack brouberol 👍🏾 [06:46:59] I've set a 2w silence [07:36:56] !log deploying mw-page-content-change-enrich on codfw after kafka has finished synchronizing its replicas [07:36:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:38:40] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2023_09 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [07:58:40] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2023_09 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [08:36:46] ^ I believe that the brief Druid alert above was an effect of loading the new mediawiki_history_reduced_2023_09 snapshot to duid-public. We've seen this before, where we tell druid to load a new snapshot and it complains that it isn't loaded, whilst loading it as quickly as possible from HDFS. We probably need to tweak the alert a little. [08:45:22] brouberol: Didn't you make a ticket recently about adding Kafka discovery records via round-robin DNS or something? I can't find it at the moment. [08:45:44] I didn't make a ticket, mainly suggested it in the OKR draft document [08:46:52] Ah, gotcha, thanks. I happened upon this, which was an earlier discussion of the same thing: https://phabricator.wikimedia.org/T213561 [08:48:43] It sort of explains how we got here, but highlights some of the good points that you made as well. [08:50:24] More recently, this has been created, which you might be interested in: https://phabricator.wikimedia.org/T331894 - It's more generic than just kafka, but only relevant for k8s services. There is some cross-over with your suggestion regarding kafka discovery. [08:51:05] Thanks! I'll have a read! [09:06:50] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 4 others: Merge ks-Arab and ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Pginer-WMF) [09:36:50] 10Data-Platform-SRE: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10brouberol) Once restarted, the app is able to produce to the `codfw.mediawiki.page_content_change.v1` topic. The throughput has decreased compared to before... [09:37:57] 10Data-Platform-SRE: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10brouberol) We will swap leadership between brokers `1002` and `1013`, and assess any impact on the app performance. Initial state: `brouberol@kafka-jumbo1... [09:39:20] 10Data-Platform-SRE: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10brouberol) The swap between `1002` and `1013` is done: ` brouberol@kafka-jumbo1010:~/topicmappr/out-files$ kafka reassign-partitions --reassignment-json-fi... [09:42:54] 10Data-Platform-SRE: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10brouberol) The leadership has been transferred to 1013: ` brouberol@kafka-jumbo1010:~/topicmappr/out-files$ kafka topics --describe --topic codfw.mediawiki... [09:45:29] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) 05Resolved→03Open It's not quite fixed, as it's doing something else annoying. It keep rewriting the fil... [09:50:04] 10Data-Platform-SRE: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10brouberol) We now remove broker 1002 from the replicas: ` brouberol@kafka-jumbo1010:~/topicmappr/out-files$ kafka reassign-partitions --reassignment-json-fi... [10:03:53] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) Things to fix: 1) Visiting the root page a 500 is returned (see above), afaics due to https://gerrit.wikime... [10:17:43] (SystemdUnitFailed) firing: (4) druid-broker.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:43] (SystemdUnitFailed) firing: (4) druid-broker.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:23:56] ^ stevemunene - these are expected, aren't they? Are you bringing druid1009 into service at the moment? [10:26:28] Yes they are btullis , though the service seems to be running ok. [10:29:00] stevemunene: ack, maybe it's just an issue with the systemd monitoring of services. I've seen it be quite slow to pick up on changes before, so maybe it will just fix itself in a minute. [10:32:42] Hi btullis and stevemunene - the number of HDFS corrupt blocks has not gone donw since it fired last week - anything we should do? [10:35:35] PROBLEM - Check systemd state on druid1009 is CRITICAL: CRITICAL - degraded: The following units failed: druid-historical.service,druid-middlemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:43] (SystemdUnitFailed) firing: (5) druid-broker.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:38:54] RECOVERY - Check systemd state on druid1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:45:56] (03PS1) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [10:46:28] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [10:49:54] (03CR) 10Urbanecm: [C: 04-1] "issue: this probably shouldn't be right under analytics. i think this would make sense under `analytics/mediawiki/accountcreation`. what d" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [10:51:58] joal: We need to do another rolling restart of the nameservers to correct it. However, it's not actually corrupt blocks. It's a false reading the in the prometheus exporter. [10:52:04] o/ joal it could be a false positive from the JMX metric reporting https://www.irccloud.com/pastebin/PcLkgE8Y/ [10:52:21] ack about the false positive - thanks folks [10:52:39] https://www.irccloud.com/pastebin/cegRIZIt/ [10:55:13] However, we can't just silence the alert and forget about it, so we *do* need to do the rolling restart. I'm just a bit busy to give it focus now, and we've got a lot going on with kafka partition moves, druid servers coming online etc. [10:55:29] FTR, kafka partition moves are done for now [10:55:32] no problem btullis - thnkas for the heads up :) [10:55:55] I mean, on hold, until we figure out the performance degradatation issue joal's been reporting [10:56:03] (03PS2) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [10:56:31] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [11:01:07] PROBLEM - Druid middlemanager on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:01:21] PROBLEM - Check systemd state on druid1009 is CRITICAL: CRITICAL - degraded: The following units failed: druid-historical.service,druid-middlemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:33] PROBLEM - Druid historical on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:02:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:12:38] (03PS3) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [11:13:02] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [11:15:23] btullis: WRT to the kafka broker decommission: we've already evacuated more topics than I anticipated! 275/1052 (26%) [11:16:35] Great! Was the mw--page-content-change-enrich application the only one that was negatively impacted by the move, as far as we know? Or were there others? [11:17:56] as far as I know and heard, yes [11:18:18] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) I think we might need to add each host to lvs first before it can be fully part of the druid cluster {F37917815} There's no change in hosts 1hr after adding druid1009 to the l... [11:18:40] and now that kafka is back to being in a nominal state, the app is still experiencing slowness. Not to say that this slowness isn't caused by kafka in any way. We're still investigating [11:19:03] RECOVERY - Druid middlemanager on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:19:17] RECOVERY - Check systemd state on druid1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:20] brouberol: Oh, OK. Let me know if you'd like any more eyes on it. [11:19:29] RECOVERY - Druid historical on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:20:34] right now, thomas is looking at it, as he's the app SME, but sure, I'll loop you in if we don't find anything 👍 [11:22:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:22:51] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10BTullis) What about running the `sre.druid.roll-restart-workers` cookbook on this cluster, so that it restarts the processes? I think that this is more likely to make sure that the new host... [11:23:03] brouberol: Great, thanks. [11:34:12] (03PS4) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [11:34:38] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [11:40:03] PROBLEM - Druid middlemanager on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:40:19] PROBLEM - Check systemd state on druid1009 is CRITICAL: CRITICAL - degraded: The following units failed: druid-historical.service,druid-middlemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:31] PROBLEM - Druid historical on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:42:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:47:47] RECOVERY - Check systemd state on druid1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:59] RECOVERY - Druid historical on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:49:01] RECOVERY - Druid middlemanager on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:52:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:59:46] (03PS5) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [12:00:20] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [12:10:11] PROBLEM - Check systemd state on druid1009 is CRITICAL: CRITICAL - degraded: The following units failed: druid-historical.service,druid-middlemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:23] PROBLEM - Druid historical on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:11:27] PROBLEM - Druid middlemanager on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:12:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:17:43] RECOVERY - Check systemd state on druid1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:53] RECOVERY - Druid historical on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:18:57] RECOVERY - Druid middlemanager on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:22:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:38:45] PROBLEM - Druid historical on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:40:03] PROBLEM - Check systemd state on druid1009 is CRITICAL: CRITICAL - degraded: The following units failed: druid-historical.service,druid-middlemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:41:19] PROBLEM - Druid middlemanager on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:42:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:53] qq: how can I know how many physical hard drives are behind the /dev/mapper/vg1-srv volume of a kafka-jumbo host? [12:46:58] brouberol: Something like `sudo megacli -LDInfo -LAll -aAll` from https://wikitech.wikimedia.org/wiki/MegaCli#Array_status [12:47:27] Oh, hang on. Maybe they're software RAID. Checking now. [12:48:47] RECOVERY - Druid middlemanager on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:49:03] RECOVERY - Check systemd state on druid1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:13] RECOVERY - Druid historical on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:49:13] Nope, hardware RAID 10 - Looks like 12 drives, so 6 pairs. [12:52:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:24] btullis thanks! [13:07:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:59] PROBLEM - Check systemd state on druid1009 is CRITICAL: CRITICAL - degraded: The following units failed: druid-historical.service,druid-middlemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:09] PROBLEM - Druid historical on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [13:11:13] PROBLEM - Druid middlemanager on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [13:12:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:17:43] (SystemdUnitFailed) firing: (5) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:18:45] RECOVERY - Druid middlemanager on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [13:19:03] RECOVERY - Check systemd state on druid1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:13] RECOVERY - Druid historical on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [13:19:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:43] (SystemdUnitFailed) firing: (5) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:41] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/1 Create a build script for... [13:27:39] !log Manually mark wikidata_item_page_link_weeklywait_for_mediawiki_page_move [13:27:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:27:42] Schedule: 0 0 * * 1 info Next Run: 2023-10-02, 00:00:00 [13:27:42] woops [13:28:09] (03PS6) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [13:28:09] !log Manually mark wikidata_item_page_link_weekly.wait_for_mediawiki_page_move task successfull (with note) to overcome datacenter switchover sensor issue [13:28:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:28:37] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [13:30:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:43] (SystemdUnitFailed) firing: (5) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:32:51] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) >>! In T336042#9215049, @BTullis wrote: > What about running the `sre.druid.roll-restart-workers` cookbook on this cluster, so that it restarts the processes? I think that this... [13:40:20] !log roll-restart druid public workers to pick up a new worker node. T336042 [13:40:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:40:23] T336042: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 [13:47:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:57] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:11] (03PS7) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [14:01:39] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [14:01:48] 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10CodeReviewBot) tchin merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/50... [14:02:03] ottomata: standup? [14:02:43] (SystemdUnitFailed) firing: (3) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:51] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) I've gone with the option of making a separate Debian package for each of the yarn shufflers that we wish to have installed... [14:11:21] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10Ottomata) Would it be worth partitioning mediawiki.page_change.v1 too? Just so we can run multiple consume... [14:14:08] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) a:03Jclark-ctr [14:34:51] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [14:42:44] (SystemdUnitFailed) firing: (3) druid-middlemanager.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:52:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:53:56] PROBLEM - Druid historical on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:54:06] PROBLEM - Check systemd state on druid1009 is CRITICAL: CRITICAL - degraded: The following units failed: druid-historical.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:40] --^ Silencing the druid1009 alerts as we bring the server into service [14:57:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:32] PROBLEM - Druid middlemanager on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:03:21] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) druid1009 fails the roll restart at the last stage pooling, since it is not part of the druid public broker VIP yet. ` PASS |██████████████████████████████████████████████████... [15:07:04] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) the `druid-historical.service` errors are related to the fact that the server is not yet part of the zookeper cluster. ` 2023-10-02T14:50:49,454 INFO org.apache.zookeeper.Clie... [15:15:25] 10Data-Engineering, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform, 10Patch-For-Review: Improve the flink-app chart to provide more useful defaults - https://phabricator.wikimedia.org/T346315 (10Gehel) a:03EBernhardson [15:18:34] RECOVERY - Druid historical on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:18:46] RECOVERY - Check systemd state on druid1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:10] RECOVERY - Druid middlemanager on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:20:25] hey folks [15:20:34] I have updated https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration#Change_to_mediawiki/services/eventstreams_repository with info about how to test eventstreams [15:22:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:22:54] you rock elukey :) [15:23:30] 10Data-Engineering, 10Anti-Harassment, 10Data Engineering and Event Platform Team, 10Privacy Engineering, and 4 others: Exposing revIDs (nothing more) of deleted/suppressed edits for research to respect their removal - https://phabricator.wikimedia.org/T200559 (10Ottomata) > page deletion This will either... [15:23:55] hello! I'm looking to roll out the aqs2 services that currently use druid (edit-analytics and editor-analytics) and I don't know much about druid - do you need credentials to query druid? I don't see any in AQS1's config so I suspect not but just wanted to be sure [15:24:26] joal: final version in https://stream-beta.wmflabs.org/, seems feature complete (running node 18!) [15:24:56] hnowlan: we don't use TLS for druid :( [15:25:23] 10Data-Platform-SRE, 10Discovery-Search, 10SRE-OnFire, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10Gehel) [15:27:04] 10Data-Platform-SRE, 10SRE-OnFire, 10Discovery-Search (Current work), 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10Gehel) [15:28:13] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) https://stream-beta.wmflabs.org/ is up to date with the new version running node 18! A... [15:30:34] (03PS8) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [15:31:08] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [15:37:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [15:38:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [15:41:01] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the NFDI4Culture Knowledge Graph - https://phabricator.wikimedia.org/T346455 (10Gehel) [15:42:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:52:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [15:53:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [15:58:19] (03PS9) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [15:58:45] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [16:05:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [16:05:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [16:10:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [16:10:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [16:12:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [16:14:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [16:14:57] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) [16:15:12] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) [16:19:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [16:19:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [16:22:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:31:12] Hi team - this alert about the kafka consumer lag is known - we've restarted the job with more resources so that it now backfills - this alert will eventually disappear :) [16:32:43] (SystemdUnitFailed) firing: (5) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:33:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [16:33:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [16:34:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:34] 10Data-Engineering, 10Event-Platform (Sprint 12), 10Patch-For-Review: mediawiki-event-enrichment: issue async requests from ProcessFunction - https://phabricator.wikimedia.org/T332948 (10Ottomata) 05Open→03Resolved Pretty sure this task is done. Resolving. [16:42:43] (SystemdUnitFailed) firing: (5) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:23] !log Silent the "High Kafka consumer lag for mw_page_content_change_enrich in codfw" alert for 3 days [16:45:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:47:43] (SystemdUnitFailed) firing: (5) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:52:43] (SystemdUnitFailed) firing: (5) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:07:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:12:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:17:31] 10Data-Engineering, 10Data-Platform-SRE: Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10Ottomata) Wow very cool! [17:21:02] 10Data-Engineering, 10EventStreams, 10Event-Platform: Make eventgate-analytics-external the default event service - https://phabricator.wikimedia.org/T342610 (10Ottomata) > `composer buildConfigCache` :o TIL! Nice! Is this just so it doesn't have to be set manually for analytics devs creating streams? I'm... [17:22:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:28:42] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10Ottomata) a:05gmodena→03Ottomata We should fix: eventgate-wikimedia's reca... [17:30:23] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10Ottomata) > if they have any value, or if they're left over cruft. Cruft for sure. Please delete. [17:31:16] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventgate: cache refreshes should fetch stream configs using in batches - https://phabricator.wikimedia.org/T346899 (10Ottomata) a:03Ottomata [17:36:42] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) WOW thank you Luca! [17:37:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:39:51] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [NEEDS GROOMING] schema services should be moved to k8s - https://phabricator.wikimedia.org/T347421 (10Ottomata) I think this will be harder than it sounds. I don't think there is a way to automate dynamic deployments of data... [17:39:58] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [17:40:04] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [17:40:41] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: services should use common logging schema - https://phabricator.wikimedia.org/T347498 (10Ottomata) Interested. Especially if we spend some time on these for {T347477} [17:40:51] 10Data-Engineering, 10Movement-Insights: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10mforns) [17:41:46] 10Data-Engineering, 10Movement-Insights: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10mforns) a:05mforns→03None [17:42:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:49] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: Enum with an entry of `null` should fail jsonschema-tools validation - https://phabricator.wikimedia.org/T344511 (10Ottomata) :) [17:51:48] (03PS1) 10MNeisler: Add the wikifunctions_ui metrics platform schema to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/962657 (https://phabricator.wikimedia.org/T344277) [17:52:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:56:29] 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10Ottomata) > Some of the kafka topics are remnants of tests and misconfiguration/misnamings. There is an option t... [17:58:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [17:58:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [18:04:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [18:04:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [18:09:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [18:09:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [18:09:57] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [18:09:57] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [18:12:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:42] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [18:19:42] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [18:22:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:26:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [18:26:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [18:35:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [18:37:21] 10Data-Engineering, 10Discovery-Search, 10serviceops-radar, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [NEEDS GROOMING] Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10bking) [18:37:26] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10bking) [18:42:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:47:58] 10Data-Engineering, 10Data-Platform-SRE: Misconfigured proxies on data-engineering hosts - https://phabricator.wikimedia.org/T326302 (10RKemper) [18:52:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:00:13] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed wi... [19:00:17] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed wi... [19:02:08] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [19:03:39] (03PS9) 10Sharvaniharan: New Event schema for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 [19:04:14] (03CR) 10Sharvaniharan: "Thank you Toni... and done!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (owner: 10Sharvaniharan) [19:11:38] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [19:12:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:22:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:37:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:42:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:52:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:12:42] 10Data-Platform-SRE: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10Ottomata) Earlier today, @JAllemandou and @tchin restarted the streaming app. Messages are now being produced, so the app is working. However, it was not... [20:12:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:14:46] 10Data-Platform-SRE: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10Ottomata) I also started an incident report: https://wikitech.wikimedia.org/wiki/Incidents/2023-09-28_mw-page-content-change-enrich @brouberol @JAllemandou... [20:22:22] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed wi... [20:22:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:31:50] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed wi... [20:32:07] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform, 10Patch-For-Review: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-en... [20:32:15] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform, 10Patch-For-Review: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10CodeReviewBot) otto opened https://gitlab.wikimedia.org/repos/data-engineering/m... [20:32:53] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Patch-For-Review: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) [20:36:09] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10bking) A few notes on this process: - I used [[ https://facebook.github.io/zstd/ | zstd compression ]] to compress the JNL file, as it suppo... [20:42:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:52:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:01:46] 10Data-Engineering, 10Pageviews-API, 10Tool-Pageviews: Mediaviews analysis doesn't work for files with non-standard letters in the filename? - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) Well first, the file was only uploaded 22 hours ago, so the data might simply [[ https://pageviews.wmcloud.o... [21:03:49] 10Data-Engineering, 10Pageviews-API, 10Tool-Pageviews: Mediaviews analysis doesn't work for files with non-standard letters in the filename? - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) >>! In T347899#9217165, @Aklapper wrote: > Cannot reproduce. > > I only see an `Uncaught DOMException: The... [21:07:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:05] 10Data-Engineering, 10Pageviews-API, 10Tool-Pageviews: Mediaviews analysis doesn't work for files with non-standard letters in the filename? - https://phabricator.wikimedia.org/T347899 (10Aklapper) > Do you have ad blockers or privacy extensions enabled, by chance? Ah, yes. Sorry for the noise! [21:12:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:43] 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'` - https://phabricator.wikimedia.org/T347647 (10dr0ptp4kt) The addshore .jnl (August file)... [21:22:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:41:52] (03CR) 10Tsevener: [C: 03+1] New Event schema for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960100 (owner: 10Sharvaniharan) [21:42:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:52:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:12:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:22:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:31:11] 10Data-Engineering, 10Tool-Pageviews: None result with some chars in the file name - https://phabricator.wikimedia.org/T347899 (10Dusan_Krehel) p:05Triage→03Medium [22:35:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [22:37:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:40:24] 10Data-Engineering, 10Tool-Pageviews: None result with some chars in the file name - https://phabricator.wikimedia.org/T347899 (10Dusan_Krehel) [22:42:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:52:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:12:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:22:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:42:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:52:43] (SystemdUnitFailed) firing: (4) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:57:43] (SystemdUnitFailed) firing: (5) druid-historical.service Failed on druid1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:59:34] PROBLEM - Druid overlord on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid